融合注意力与残差连接的视觉Transformer跟踪算法

doi:10.3778/j.issn.1002-8331.2403-0319

摘要/Abstract

摘要： 针对目标跟踪中网络模型加深导致低级特征丢失和速度与精度不平衡的问题，提出了一种融合注意力与残差连接的VIT跟踪算法MRATrans（multi-head residual attention Transformer)。使用MobileViT作为主干网络，在保持较低模型复杂度的同时实现了丰富的特征表示。提出残差注意力模块（residual attention module，RAM），有效防止低级特征丢失问题，并设计了多头残差注意力（multi-head residual attention，MHRA）模块，同时关注不同子空间表示的信息，进一步提高了模型的表达能力；构建密集像素相关（dense pixel correlation，DPC）模块计算模板和搜索区域的相似性，避免空间失真，获得具有更丰富语义信息的响应图；通过分类回归网络实现准确的跟踪。通过OTB100、VOT2018和GOT-10k数据集上的大量实验，证明了MRATrans与主流算法相比拥有更优的性能，并达到87 帧/s的速度，在保持高效性的同时实现了准确的跟踪。

关键词: 目标跟踪, 特征丢失, 孪生网络, 残差注意力, 像素互相关

Abstract: In order to solve the problem of deepening the network model in target tracking, which leads to the loss of low-level features and the imbalance between speed and accuracy, a VIT tracking algorithm MRATrans is proposed that combines attention and residual connections. Using MobileViT as the backbone network achieves rich feature representation while maintaining low model complexity. Firstly, the residual attention module (RAM) is proposed to effectively prevent the problem of low-level feature loss, and the multi-head residual attention (MHRA) module is designed to simultaneously focus on information represented by different subspaces, further improving the expressive ability of the model. In addition, a dense pixel correlation (DPC) module is constructed to calculate the similarity between the template and the search area, avoid spatial distortion, and obtain a response map with richer semantic information. Finally, it enables accurate tracking through classification regression network. Through a large number of experiments on OTB100, VOT2018 and GOT-10k data sets, it is proved that MRATrans has better performance compared with mainstream algorithms, and reaches a speed of 87 frame/s, achieving accurate tracking while maintaining high efficiency.

Key words: target tracking, feature loss, Siamese network, residual attention, pixel correlation

田攀帅, 葛海波, 安玉, 薛紫涵. 融合注意力与残差连接的视觉Transformer跟踪算法[J]. 计算机工程与应用, 2025, 61(13): 280-290.

TIAN Panshuai, GE Haibo, AN Yu, XUE Zihan. Visual Transformer Tracking Algorithm Integrating Attention and Residual Connection[J]. Computer Engineering and Applications, 2025, 61(13): 280-290.

参考文献

[1] KUGARAJEEVAN J, KOKUL T, RAMANAN A, et al. Transformers in single object tracking: an experimental survey[J]. IEEE Access, 2023, 11: 80297-80326.
[2] 胡硕, 姚美玉, 孙琳娜, 等. 融合注意力特征的精确视觉跟踪[J]. 计算机科学与探索, 2023, 17(4): 868-878.
HU S, YAO M Y, SUN L N, et al. Accurate visual tracking with attention feature[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(4): 868-878.
[3] 孙子文, 钱立志, 杨传栋, 等. 基于Transformer的视觉目标跟踪方法综述[J]. 计算机应用, 2024, 44(5): 1644-1654.
SUN Z W, QIAN L Z, YANG C D, et al. Survey of visual object tracking methods based on Transformer[J]. Journal of Computer Applications, 2024, 44(5): 1644-1654.
[4] JAVED S, DANELLJAN M, KHAN F S, et al. Visual object tracking with discriminative filters and Siamese networks: a survey and outlook[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 6552-6574.
[5] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583-596.
[6] DANELLJAN M, H?GER G, KHAN F S, et al. Convolutional features for correlation filter based visual tracking[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop. Piscataway: IEEE, 2015: 58-66.
[7] LI F, TIAN C, ZUO W M, et al. Learning spatial-temporal regularized correlation filters for visual tracking[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4904-4913.
[8] BHAT G, JOHNANDER J, DANELLJAN M, et al. Unveiling the power of deep tracking[C]//Proceedings of the European Conference on Computer Vision, 2018: 483-498.
[9] 梁义涛, 韩永波, 李磊. 深度长时目标跟踪算法综述[J]. 计算机工程与应用, 2023, 59(4): 1-17.
LIANG Y T, HAN Y B, LI L. Survey on deep-learning-based long-term object tracking algorithms[J]. Computer Engineering and Applications, 2023, 59(4): 1-17.
[10] 韩瑞泽, 冯伟, 郭青, 等. 视频单目标跟踪研究进展综述[J]. 计算机学报, 2022, 45(9): 1877-1907.
HAN R Z, FENG W, GUO Q, et al. Single object tracking research: a survey[J]. Chinese Journal of Computers, 2022, 45(9): 1877-1907.
[11] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of European Conference on Computer Vision, 2016: 850-865.
[12] VALMADRE J, BERTINETTO L, HENRIQUES J, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2805-2813.
[13] WANG Q, GAO J, XING J L, et al. DCFNet: discriminant correlation filters network for visual tracking[J]. arXiv:1704. 04057, 2017.
[14] LI B, YAN J J, WU W, et al. High performance visual tracking with Siamese region proposal network[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8971-8980.
[15] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[16] LI B, WU W, WANG Q, et al. SiamRPN: evolution of Siamese visual tracking with very deep networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 4282-4291.
[17] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[18] XU Y D, WANG Z Y, LI Z X, et al. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12549-12556.
[19] GUO D Y, WANG J, CUI Y, et al. SiamCAR: Siamese fully convolutional classification and regression for visual tracking[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 6269-6277.
[20] CHEN Z D, ZHONG B N, LI G R, et al. Siamese box adaptive network for visual tracking[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 6668-6677.
[21] ZHANG Z P, PENG H W, FU J L, et al. Ocean: object-aware anchor-free tracking[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 771-787.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[23] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[24] 彭浩康, 葛芸, 杨小雨, 等. 基于Deformable Transformer和自适应检测头的遥感图像目标检测[J]. 激光与光电子学进展, 2024, 61(12): 325-336.
PENG H K, GE Y, YANG X Y, et al. Target detection in remote sensing image based on Deformable Transformer and adaptive detection head[J]. Laser & Optoelectronics Progress, 2024, 61(12): 325-336.
[25] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the International Conference on Machine Learning, 2021: 10347-10357.
[26] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 10012-10022.
[27] 汪强, 卢先领. 时空模板更新的Transformer目标跟踪算法[J]. 计算机科学与探索, 2023, 17(9): 2161-2173.
WANG Q, LU X L. Transformer object tracking algorithm based on spatio-temporal template update[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(9): 2161-2173.
[28] HAN K, XIAO A, WU E, et al. Transformer in transformer[C]//Advances in Neural Information Processing Systems, 2021: 15908-15919.
[29] CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[J]. arXiv:2102.10882, 2021.
[30] LI Y H, WU C Y, FAN H Q, et al. MViTv2: improved multiscale vision transformers for classification and detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 4804-4814.
[31] WANG N, ZHOU W G, WANG J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1571-1580.
[32] CAO Z A, FU C H, YE J J, et al. HiFT: hierarchical feature transformer for aerial tracking[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 15457-15466.
[33] YAN B, PENG H W, FU J L, et al. Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 10448-10457.
[34] MEHTA S, RASTEGARI M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[J]. arXiv:2110. 02178, 2021.
[35] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.
[36] FAN H, LIN L T, YANG F, et al. LaSOT: a high-quality benchmark for large-scale single object tracking[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5374-5383.
[37] HUANG L H, ZHAO X, HUANG K Q. GOT-10k: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(5): 1562-1577.
[38] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 2014 European Conference on Computer Vision, 2014: 740-755.
[39] LOSHCHILOV I, HUTTER F. Decoupled weight decay regu- larization[J]. arXiv:1711.05101, 2017.
[40] WU Y, LIM J, YANG M H. Online object tracking: a benchmark[C]//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2013: 2411-2418.
[41] GUO D Y, SHAO Y Y, CUI Y, et al. Graph attention tracking[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 9543-9552.
[42] ZHU Z, WANG Q, LI B, et al. Distractor-aware Siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision, 2018: 101-117.