计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (12): 183-190.DOI: 10.3778/j.issn.1002-8331.2203-0035

• 图形图像处理 • 上一篇    下一篇

基于视觉Transformer的双流目标跟踪算法

江英杰,宋晓宁   

  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122
  • 出版日期:2022-06-15 发布日期:2022-06-15

Dual-Stream Object Tracking Algorithm Based on Vision Transformer

JIANG Yingjie, SONG Xiaoning   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2022-06-15 Published:2022-06-15

摘要: 目前基于Transformer的目标跟踪算法主要利用Transformer来融合深度卷积特征,忽略了Transformer在特征提取和解码预测方面的能力。针对上述问题,提出一种基于视觉Transformer的双流目标跟踪算法。引入基于注意力机制的Swin Transformer进行特征提取,通过移位窗口进行全局信息建模。使用Transformer编码器对目标特征和搜索区域特征进行充分融合,使用解码器学习目标查询中的位置信息。分别对编解码器中的双流信息进行目标预测。在决策层面上进一步地加权融合得到最终跟踪结果,并使用多监督策略。该算法在LaSOT、TrackingNet、UAV123和NFS四个具有挑战性的大规模跟踪数据集上取得了先进的结果,分别达到67.4%、80.9%、68.6%和66.0%的成功率曲线下面积,展示了其强大的潜力。此外,由于避免了复杂的后处理步骤,能够端到端进行目标跟踪,跟踪速度可达42?FPS。

关键词: 目标跟踪, 深度学习, 孪生网络, Transformer, 注意力机制

Abstract: Transformer based object tracking algorithms mainly use Transformer to fuse deep convolution features, ignoring the ability of Transformer in feature extraction and decoding prediction. To mitigate the above problems, a dual-stream object tracking algorithm based on vision Transformer is proposed. Swin Transformer based on attention mechanism is introduced for feature extraction, and global information modeling is performed by shifting windows. The Transformer encoder is used to fully fuse the target features and the search region features, and the decoder is used to learn the location information in the target query. Then, target prediction is performed separately for the dual-stream information in the encoder-decoder. Further weighted fusion at the decision level is used to obtain the final tracking result, and a multi-supervised strategy is used. The proposed algorithm achieves state-of-the-art results on four challenging large-scale tracking datasets, LaSOT, TrackingNet, UAV123 and NFS, reaching area under the curve of success rate of 67.4%, 80.9%, 68.6%, and 66.0%, respectively, demonstrating its strong potential. Furthermore, end-to-end object tracking is enabled with a tracking speed of 42 FPS due to the avoidance of complex post-processing steps.

Key words: object tracking, deep learning, siamese network, Transformer, attention mechanism