计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (19): 230-241.DOI: 10.3778/j.issn.1002-8331.2307-0069

• 图形图像处理 • 上一篇    下一篇

融合时空信息的Transformer单目标跟踪算法

江进宝,宣士斌,付杰   

  1. 1. 广西民族大学  人工智能学院,南宁  530006
    2. 广西民族大学  广西混杂计算与集成电路设计分析重点实验室,南宁  530006
  • 出版日期:2024-10-01 发布日期:2024-09-30

Transformer Single Target Tracking Algorithm Integrating Spatio-Temporal Information

JIANG Jinbao, XUAN Shibin, FU Jie   

  1. 1. College of Artificial Intelligence, Guangxi Minzu University, Nanning 530006, China
    2. Guangxi Key Laboratory of Hybrid Computation & Analysis, Guangxi Minzu University, Nanning 530006, China
  • Online:2024-10-01 Published:2024-09-30

摘要: 目前,主流的基于孪生网络的单目标跟踪方法,通过计算模板与搜索区域之间的相似度来匹配目标,缺乏对目标时空状态信息的利用。特别是当场景中存在多个相似目标时,孪生网络跟踪器往往无法精确区分目标,从而导致跟踪错误。针对上述问题,提出一种融合时空信息的Transformer单目标跟踪算法(SIFTransT)。该算法通过MixFormer(end-to-end tracking with iterative mixed attention)跟踪器获取初步的跟踪结果,设计了一个目标状态计算模块,用于计算并存储目标的状态信息,包括目标位置、边界框、速度、加速度、运动方向等,以此深入挖掘目标状态信息。构建了一个基于Transformer的时空信息融合模块,利用编码器的自注意力和解码器的交叉注意力,深入融合目标最近一段时间的状态信息,从而更加准确地对目标状态进行建模,提高目标跟踪的准确性。在LaSOT数据集上的实验结果表明,相比基准算法MixFormer,SIFTransT算法在AUC指标提高了2.8个百分点,PNorm指标提升了2.6个百分点,P指标提升了2.1个百分点,在搭载RTX8000显卡的服务器上平均每秒处理帧数达28帧。

关键词: 单目标跟踪, 目标状态计算, 注意力机制, 时空信息融合

Abstract: At present, the mainstream single target tracking method based on twin network matches the target by calculating the similarity between the template and the search area, but lacks the use of the space-time state information of the target. Especially when there are multiple similar targets in the scene, twin network trackers often cannot accurately distinguish the targets, resulting in tracking errors. To solve these problems, a single target tracking algorithm (SIFTransT) based on spatio-temporal information fusion in Transformer is proposed. Firstly, the algorithm obtains preliminary tracking results through MixFormer(end-to-end tracking with iterative mixed attention) tracker. Secondly, a target state calculation module is designed to calculate and store the target state information, including target position, boundary frame, speed, acceleration, movement direction, etc., in order to dig the target state information deeply. Finally, a spatial-temporal information fusion module based on Transformer is constructed, which uses the self-attention of encoder and cross-attention of decoder to deeply integrate the state information of the target in the latest period of time, so as to more accurately model the state of the target and improve the accuracy of target tracking. The experimental results on LaSOT data set show that compared with the benchmark algorithm MixFormer, SIFTransT algorithm has improved the AUC index by 2.8 percentage points, PNorm index by 2.6 percentage points and P index by 2.1 percentage points, and the average frame processing per second on the server equipped with RTX8000 graphics card has reached 28 frames.

Key words: single target tracking, target state calculation, attention mechanism, space-time information fusion