利用Transformer的多模态目标跟踪算法

doi:10.3778/j.issn.1002-8331.2302-0035

摘要/Abstract

摘要： 目前目标跟踪方法大多通过融合不同模态信息进行定位决策，存在信息提取不充分、融合方法简单、弱光场景无法准确跟踪目标的问题。为此，提出一种基于Transformer的多模态目标跟踪算法（Trans-RGBT）：利用伪孪生网络对可见光图像和红外图像分别进行特征提取，并在特征层面充分融合；将首帧目标信息调制到待跟踪帧的特征向量中，得到一个专用跟踪器；应用Transformer的方法对视野中的目标进行编解码，通过空间位置预测分支预测目标在视野中的空间位置，并结合历史信息滤除干扰目标，得到目标的准确位置；使用矩形框回归网络预测目标的外接矩形框，从而实现目标准确跟踪。在最新的大规模数据集VTUAV、RGBT234上进行了实验，与孪生网络（Siam-based）、滤波（filter-based）算法相比，Trans-RGBT精度更高、鲁棒性更好、速度接近实时，达22?FPS。

关键词: 多模态融合, 可见光图像, 红外图像, Transformer, 目标跟踪

Abstract: The current object tracking methods mostly fuse different modal information to make localization decisions, which has the problems of insufficient information extraction, simple fusion methods, and inability to accurately track targets in low-light scenes. To this end, a Transformer-based multi-modal object tracking algorithm (Trans-RGBT) is proposed. Firstly, the visible and infrared images are extracted separately by using a pseudo-twin network, and fully fused at the feature level. Secondly, the first frame of target information is modulated into feature vector of the frame to be tracked to obtain a dedicated tracker. Then, transformer method is applied to code and decode for target in the field of view. Spatial position of the target in the field of view is predicted by the spatial position prediction branch and the interference target is filtered out by combining the historical information to obtain accurate position of the target. Finally, external rectangular frame of the target is predicted by using the rectangular frame regression network, so as to achieve accurate target tracking. Full experiments are conducted on the latest large-scale dataset VTUAV and RGBT234. In comparison with the twin network (Siam-based) and filtering (filter-based) algorithms, Trans-RGBT has higher accuracy, better robustness and achieves a real-time tracking speed of 22 frames per second.

Key words: multi-modal fusion, visible images, infrared images, Transformer, object tracking

刘万军, 梁林林, 曲海成. 利用Transformer的多模态目标跟踪算法[J]. 计算机工程与应用, 2024, 60(11): 84-94.

LIU Wanjun, LIANG Linlin, QU Haicheng. Trans-RGBT：RGBT Object Tracking with Transformer[J]. Computer Engineering and Applications, 2024, 60(11): 84-94.

参考文献

[1] 冈萨雷斯. 数字图像处理第四版[M]. 北京: 电子工业出版社, 2020.
Gonzalez. Digital image processing fourth edition[M]. Beijing: Electronic Industry Press, 2020.
[2] 谌海云, 王海川, 黄忠义, 等. 引入轻量级Transformer的无人机视觉跟踪[J]. 计算机工程与应用, 2024, 60(2): 244-253.
CHEN H Y, WANG H C, HUANG Z Y, et al. UAV visual tracking with lightweight Transformer[J]. Computer Engineering and Applications, 2024, 60(2): 244-253.
[3] 邱德粉, 江俊君, 胡星宇, 等. 高分辨率可见光图像引导红外图像超分辨率的Transformer网络[J]. 中国图象图形学报, 2023, 28(1): 196-206.
QIU D F, JIANG J J , HU X Y, et al. Guided Transformer for high-resolution visible image guided infrared image super-resolution[J]. Journal of Image and Graphics, 2023, 28(1): 196-206.
[4] 左一帆, 方玉明, 马柯德. 深度学习时代图像融合技术进展[J]. 中国图象图形学报, 2023, 28(1): 102-117.
ZUO Y F, FANG Y M, MA K D. The critical review of the growth of deep learning-based image fusion techniques[J]. Journal of Image and Graphics, 2023, 28(1): 102-117.
[5] PAL S K, PRAMANIK A, MAITI J, et al. Deep learning in multi-object detection and tracking: state of the art[J]. Applied Intelligence, 2021, 51: 6400-6429.
[6] 李清格, 小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述[J]. 小型微型计算机系统, 2023, 44(4): 850-861.
LI Q G, YANG X G, LU R T, et al. Transformer in computer vision: a survey[J]. Journal of Chinese Mini-Micro Computer Systems, 2023, 44(4): 850-861.
[7] CHEN X, YAN B, ZHU J, et al. Transformer tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 8126-8135.
[8] BEAL J, KIM E, TZENG E, et al. Toward transformer-based object detection[J]. arXiv:2012.09958, 2020.
[9] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37(3): 583-596.
[10] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking[C]//Proceedings of the 2016 European Conference on Computer Vision, 2016: 850-865.
[11] MARTIN D, GOUTAM B. PyTracking: visual tracking library based on PyTorch[EB/OL]. (2019-11-12)[2023-01-30]. https://github.com/visionml/pytracking.
[12] DANELLJAN M, BHAT G, KHAN F S, et al. ATOM: accurate tracking by overlap maximization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4660-4669.
[13] BHAT G, DANELLJAN M, GOOL L V, et al. Learning discriminative model prediction for tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6182-6191.
[14] PAUL M, DANELLJAN M, MAYER C, et al. Robust visual tracking by segmentation[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 571-588.
[15] BHAT G, LAWIN F J, DANELLJAN M, et al. Learning what to learn for video object segmentation[C]//Proceedings of the 16th European Conference on Computer Vision, 2020: 777-794.
[16] MAYER C, DANELLJAN M, BHAT G, et al. Transforming model prediction for tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 8731-8740.
[17] PARMAR N, VASWANI A, USZKOREIT J, et al. Image transformer[C]//Proceedings of the International Conference on Machine Learning, 2018: 4055-4064.
[18] VOIGTLAENDER P, LUITEN J, TORR P H S, et al. Siam R-CNN: visual tracking by re-detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6578-6588.
[19] WANG G, LUO C, SUN X, et al. Tracking by instance detection: a meta-learning approach[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6288-6297.
[20] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision, 2020: 213-229.
[21] CHEN X, YAN B, ZHU J, et al. Transformer tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 8126-8135.
[22] ZHANG Z, PENG H, FU J, et al. Ocean: object-aware anchor-free tracking[C]//Proceedings of the 16th European Conference on Computer Vision, 2020: 771-787.
[23] 潘梦竹, 李千目, 邱天. 深度多模态表示学习的研究综述[J]. 计算机工程与应用, 2023, 59(2): 48-64.
PAN M Z, LI Q M, QIU T. Survey of research on deep multimodal representation learning[J]. Computer Engineering and Applications, 2023, 59(2): 48-64.
[24] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[25] JIANG B, LUO R, MAO J, et al. Acquisition of localization confidence for accurate object detection[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 784-799.
[26] TIAN Z, SHEN C, CHEN H, et al. Fcos: fully convolutional one-stage object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 9627-9636.
[27] ZHANG P, ZHAO J, WANG D, et al. Visible-thermal UAV tracking: a large-scale benchmark and new baseline[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 8886-8895.
[28] LI C, LIANG X, LU Y, et al. RGB-T object tracking: benchmark and baseline[J]. Pattern Recognition, 2019, 96: 106977.
[29] REZATOFIGHI H, TSOI N, GWAK J Y, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 658-666.
[30] GAO Y, LI C, ZHU Y, et al. Deep adaptive fusion network for high performance RGBT tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019.
[31] ZHANG P, WANG D, LU H, et al. Learning adaptive attribute-driven representation for real-time RGB-T tracking[J]. International Journal of Computer Vision, 2021, 129: 2714-2729.
[32] KRISTAN M, MATAS J, LEONARDIS A, et al. The seventh visual object tracking VOT2019 challenge results[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019: 2206-2241.
[33] ZHANG X, YE P, PENG S, et al. SiamFT: an RGB-infrared fusion tracking method via fully convolutional Siamese networks[J]. IEEE Access, 2019, 7: 122122-122133.
[34] ZHANG T, LIU X, ZHANG Q, et al. SiamCDA: complementarity and distractor-aware RGB-T tracking based on siamese network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 32(3): 1403-1417.
[35] ZHANG L, DANELLJAN M, GONZALEZ-GARCIA A, et al. Multi-modal fusion for end-to-end RGB-T tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019: 2252-2261.
[36] ZHU Y, LI C, TANG J, et al. Quality-aware feature aggregation network for robust RGBT tracking[J]. IEEE Transactions on Intelligent Vehicles, 2020, 6(1): 121-130.