计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (11): 84-94.DOI: 10.3778/j.issn.1002-8331.2302-0035

• 理论与研发 • 上一篇    下一篇

利用Transformer的多模态目标跟踪算法

刘万军,梁林林,曲海成   

  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
  • 出版日期:2024-06-01 发布日期:2024-05-31

Trans-RGBT:RGBT Object Tracking with Transformer

LIU Wanjun, LIANG Linlin, QU Haicheng   

  1. College of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2024-06-01 Published:2024-05-31

摘要: 目前目标跟踪方法大多通过融合不同模态信息进行定位决策,存在信息提取不充分、融合方法简单、弱光场景无法准确跟踪目标的问题。为此,提出一种基于Transformer的多模态目标跟踪算法(Trans-RGBT):利用伪孪生网络对可见光图像和红外图像分别进行特征提取,并在特征层面充分融合;将首帧目标信息调制到待跟踪帧的特征向量中,得到一个专用跟踪器;应用Transformer的方法对视野中的目标进行编解码,通过空间位置预测分支预测目标在视野中的空间位置,并结合历史信息滤除干扰目标,得到目标的准确位置;使用矩形框回归网络预测目标的外接矩形框,从而实现目标准确跟踪。在最新的大规模数据集VTUAV、RGBT234上进行了实验,与孪生网络(Siam-based)、滤波(filter-based)算法相比,Trans-RGBT精度更高、鲁棒性更好、速度接近实时,达22?FPS。

关键词: 多模态融合, 可见光图像, 红外图像, Transformer, 目标跟踪

Abstract: The current object tracking methods mostly fuse different modal information to make localization decisions, which has the problems of insufficient information extraction, simple fusion methods, and inability to accurately track targets in low-light scenes. To this end, a Transformer-based multi-modal object tracking algorithm (Trans-RGBT) is proposed. Firstly, the visible and infrared images are extracted separately by using a pseudo-twin network, and fully fused at the feature level. Secondly, the first frame of target information is modulated into feature vector of the frame to be tracked to obtain a dedicated tracker. Then, transformer method is applied to code and decode for target in the field of view. Spatial position of the target in the field of view is predicted by the spatial position prediction branch and the interference target is filtered out by combining the historical information to obtain accurate position of the target. Finally, external rectangular frame of the target is predicted by using the rectangular frame regression network, so as to achieve accurate target tracking. Full experiments are conducted on the latest large-scale dataset VTUAV and RGBT234. In comparison with the twin network (Siam-based) and filtering (filter-based) algorithms, Trans-RGBT has higher accuracy, better robustness and achieves a real-time tracking speed of 22 frames per second.

Key words: multi-modal fusion, visible images, infrared images, Transformer, object tracking