计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (7): 267-277.DOI: 10.3778/j.issn.1002-8331.2311-0059

• 图形图像处理 • 上一篇    下一篇

改进ViT的RGB-T多模态交互跟踪算法研究

吴波,张荣芬,刘宇红   

  1. 贵州大学 大数据与信息工程学院,贵阳 550025
  • 出版日期:2025-04-01 发布日期:2025-04-01

Research on RGB-T Multimodal Interaction Tracking Algorithm with Improved ViT

WU Bo, ZHANG Rongfen, LIU Yuhong   

  1. College of Big Data and Information Engineering, Guizhou University, Guiyang 550025, China
  • Online:2025-04-01 Published:2025-04-01

摘要: 针对之前的RGB-T目标跟踪工作中,直接将可见光RGB和热红外TIR图像的特征连接在一起,或者从搜索图像中抽取候选框,对孤立的RGB和TIR候选框对进行融合,导致无法充分利用不同模态之间的互补信息,还引入了冗余背景噪声的问题,提出了一种基于改进ViT(vision Transformer)的多模态交互的RGB-T跟踪算法。使用高效多头自注意力(efficient multi-head self-attention,EMSA),加强各个注意力头部之间的信息交互;引入AdaptMLP模块,可以更灵活地处理数据,强化模型的非线性表达能力;设计了MHT(MLP-mixer hybrid Transformer)模块,在Transformer模块的注意力分支提取特征的同时,MLP-Mixer分支在空间和通道上对输入进行特征信息提取,将两个分支的特征进行融合;将MHT模块输出的特征信息馈送到TBSI(bridging search region interaction with template)层中,进行跨模态信息交互。在公开的大规模RGB-T数据集LasHeR测试集上进行测试,得到的精确度、归一化精度和成功率分别为66.9%、63.5%和53.5%,分别比基准算法提高了1.8、1.6和1.3个百分点。实验结果表明,所提算法能够提高跟踪器的性能。

关键词: RGB-T跟踪, 混合多层感知机, 多模态交互, 模板融合, 高效多头自注意力

Abstract: In view of the problem that in the previous RGB-T object tracking work, the features of visible RGB and thermal infrared TIR images are directly connected together, or the candidate boxes are extracted from the search images, and the isolated RGB and TIR candidate boxes are fused, which resulted in the failure to make full use of the complementary information between different modes, and also introduce redundant background noise issue, this paper proposes an multi-modal interaction RGB-T tracking algorithm based on the improved ViT. Firstly, the efficient multi-head self-attention (EMSA) is used to strengthen the information interaction between each attention head. Secondly, the introduction of AdaptMLP module can process data more flexibly and strengthen the nonlinear expression ability of the model. Then, the MHT (MLP-mixer hybrid Transformer) module is designed, and while the attention branch of the Transformer module extracts features, the MLP-Mixer branch extracts the feature information of the input on the space and channel, and then the features of the two branches are fused. Finally, the feature information output by the MHT module is fed into the TBSI (bridging search region interaction with template) layer for cross-modal information interaction. The precision rate, normalized precision rate and success rate obtained from LasHeR test set of the public large-scale RGB-T data set are 66.9%, 63.5% and 53.5%, respectively, which are 1.8, 1.6 and 1.3 percentage points higher than the benchmark algorithm. Experimental results show that the proposed algorithm can improve the performance of trackers.

Key words: RGB-T tracking, mixer multilayer perceptron (MLP-Mixer), multimodal interaction, template fusion, efficient multi-head self-attention (EMSA)