融合快速边缘注意力的Transformer跟踪算法

doi:10.3778/j.issn.1002-8331.2308-0447

摘要/Abstract

摘要： 针对长期目标跟踪中出现模型退化和跟踪漂移的问题，提出了一种融合快速边缘注意力的Transformer跟踪算法TransFEA（fast edge attention on Transformer）。使用ResNet-50作为Siamese网络的骨干网络，并在其每个残差块后端引入注意力网络进行特征提取，增强目标的关键信息和全局信息；边缘注意力网络（edge attention network，EA）提取模板与搜索区域的特征向量，快速注意力网络（fast attention network，FA）计算注意响应值，确定两个区域的相似度，以此调整目标位置。设计多层感知器预测边界框，避免过多超参数，使跟踪器实现了准确性与轻量化的平衡。实验结果表明，TransFEA在LaSOT数据集上成功率和准确率分别为65.3%、69.1%，运行可以达到90 FPS，提高了长期跟踪的成功率和准确率。

关键词: Transformer网络, 边缘注意力网络, 快速注意力网络, 多层感知器

Abstract: In order to solve the problems of model degradation and tracking drift in long-term target tracking, a Transformer tracking algorithm TransFEA (fast edge attention on Transformer) that integrates fast edge attention is proposed. It uses ResNet-50 as the backbone network of the Siamese network, and introduces an attention network at the back end of each residual block for feature extraction to enhance the key information and global information of the target; edge attention network (EA) extracts the feature vectors of the templates and the search area, fast attention network (FA) calculates the attention response value and determines the similarity between the two areas to adjust the target position. Designing a multi-layer perceptron to predict bounding boxes and avoid excessive hyperparameters enables the tracker to achieve a balance between accuracy and lightweight. Experimental results show that the success rate and accuracy rate of TransFEA on the LaSOT data set are 65.3% and 69.1% respectively, and the operation can reach 90 FPS, which improves the success rate and accuracy rate of long-term tracking.

Key words: Transformer network, edge attention network, fast attention network, multi-layer perceptron

薛紫涵, 葛海波, 王淑贤, 安玉, 杨雨迪. 融合快速边缘注意力的Transformer跟踪算法[J]. 计算机工程与应用, 2025, 61(1): 221-231.

XUE Zihan, GE Haibo, WANG Shuxian, AN Yu, YANG Yudi. Transformer Tracking Algorithm Integrating Fast Edge Attention[J]. Computer Engineering and Applications, 2025, 61(1): 221-231.

参考文献

[1] FAN H, LING H. Siamese cascaded region proposal networks for real-time visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 7952-7961.
[2] 刘艺, 李蒙蒙, 郑奇斌, 等. 视频目标跟踪算法综述[J]. 计算机科学与探索, 2022, 16(7): 1504-1515.
LIU Y, LI M M, ZHENG Q B, et al. Survey on video object tracking algorithms[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1504-1515.
[3] 蒋凌云, 杨金龙. 检测优化的标签多伯努利视频多目标跟踪算法[J]. 计算机科学与探索, 2023, 17(6): 1343-1358.
JIANG L Y, YANG J L. Detection optimized labeled multi-Bernoulli algorithm for visual multi-target tracking[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(6): 1343-1358.
[4] BOLME D S, BEVERIDGE J R, DRAPER B A, et al. Visual object tracking using adaptive correlation filters[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA: IEEE, 2010: 2544-2550.
[5] MA H, LIN Z, ACTON S T. FAST: fast and accurate scale estimation for tracking[J]. IEEE Signal Processing Letters, 2019, 27(99): 161-165.
[6] ZHANG L, JAGANNADAN V, PONNUTHURAI N S, et al. Robust visual tracking using oblique random forests[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Honolulu, HI, Jul 21-26, 2017. Piscataway: IEEE, 2017: 5825-5834.
[7] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking[C]//Proceedings of the 2016 European Conference on Computer Vision. Cham: Springer, 2016: 850-865.
[8] GUO Q, FENG W, ZHOU C, et al. Learning dynamic siamese network for visual object tracking[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 1781-1789.
[9] LI B, YAN J, WU W, et al. High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018: 8971-8980.
[10] ZHANG Z P, PENG H W. Deeper and wider Siamese networks for real-time visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 2019: 4586-4595.
[11] GUO D Y, WANG J, CUI Y, et al. SiamCAR: siamese fully convolutional classification and regression for visual tracking[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, Jun 13-19, 2020. Piscataway: IEEE, 2020: 6268-6276.
[12] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[13] SUN C, SHRIVASTAVA A, SINGH S, et al. Revisiting unreasonable effectiveness of data in deep learning era[C]//Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 843-852.
[14] DENG J, DONG W, SOCHERR R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009: 248-255.
[15] TAN M, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the 36th International Conference on MachineLearning, Long Beach, CA, USA, 2019: 6105-6114.
[16] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the International Conference on Machine Learning, 2021: 10347-10357.
[17] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 2021: 568-578.
[18] WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]//Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 2021: 22-31.
[19] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C]//Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 2021: 558-567.
[20] RAO Y, ZHAO W, LIU B, et al. DynamicViT: efficient vision transformers with dynamic token sparsification[C]//Advances in Neural Information Processing Systems, 2021, 34: 13937-13949.
[21] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 2021: 10012-10022.
[22] HEO B, YUN S, HAN D, et al. Rethinking spatial dimensions of vision transformers[C]//Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 2021: 11936-11945.
[23] 潘昊, 刘翔, 赵静文, 等. 联合Transformer与BYTE数据关联的多目标实时跟踪算法[J]. 激光与光电子学进展, 2023, 60(6): 154-161.
PAN H, LIU X, ZHAO J W, et al. Multitarget real-time tracking algorithm based on Transformer and BYTE data[J]. Laser & Optoelectronics Progress, 2023, 60(6): 154-161.
[24] HU H, ZHANG Z, XIE Z, et al. Local relation networks for image recognition[C]//Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 2019: 3464-3473.
[25] WANG H, ZHU Y, GREEN B, et al. Axial-deeplab: stand-alone axial-attention for panoptic segmentation[C]//Proceedings of the 2020 European Conference on Computer Vision, Glasgow, UK, 2020: 108-126.
[26] HUANG L, YUAN Y, GUO J, et al. Interlaced sparse self-attention for semantic segmentation[J]. arXiv:1907.12273, 2019.
[27] VASWANI A, RAMACHANDRAN P, SRINIVAS A, et al. Scaling local self-attention for parameter efficient visual backbones[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021: 12894-12904.
[28] CHU X, TIAN Z, WANG Y, et al. Twins: revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems, 2021, 34: 9355-9366.
[29] LI B, WU W, WANG Q, et al. SiamRPN++: evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019: 4282-4291.
[30] BA J L, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016.
[31] REZATOFIGHI H, TSOI N, GWAK J Y, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019: 658-666.
[32] HUANG L, ZHAO X, HUANG K. Got-10k: a large high diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562-1577.
[33] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 2014 European Conference on Computer Vision, Zurich, Switzerland, 2014: 740-755.
[34] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115: 211-252.
[35] WU Y, LIM J, YANG M H. Object tracking benchmark[J]. IEEE Transactions on Pattern Analysis and MAchine Intelligence, 2015, 37(9): 1834-1848.
[36] GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 2010: 249-256.
[37] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[J]. arXiv:1711.05101, 2017.