Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (13): 280-290.DOI: 10.3778/j.issn.1002-8331.2403-0319

• Graphics and Image Processing • Previous Articles     Next Articles

Visual Transformer Tracking Algorithm Integrating Attention and Residual Connection

TIAN Panshuai, GE Haibo, AN Yu, XUE Zihan   

  1. School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
  • Online:2025-07-01 Published:2025-06-30

融合注意力与残差连接的视觉Transformer跟踪算法

田攀帅,葛海波,安玉,薛紫涵   

  1. 西安邮电大学 电子工程学院, 西安 710121

Abstract: In order to solve the problem of deepening the network model in target tracking, which leads to the loss of low-level features and the imbalance between speed and accuracy, a VIT tracking algorithm MRATrans is proposed that combines attention and residual connections. Using MobileViT as the backbone network achieves rich feature representation while maintaining low model complexity. Firstly, the residual attention module (RAM) is proposed to effectively prevent the problem of low-level feature loss, and the multi-head residual attention (MHRA) module is designed to simultaneously focus on information represented by different subspaces, further improving the expressive ability of the model. In addition, a dense pixel correlation (DPC) module is constructed to calculate the similarity between the template and the search area, avoid spatial distortion, and obtain a response map with richer semantic information. Finally, it enables accurate tracking through classification regression network. Through a large number of experiments on OTB100, VOT2018 and GOT-10k data sets, it is proved that MRATrans has better performance compared with mainstream algorithms, and reaches a speed of 87 frame/s, achieving accurate tracking while maintaining high efficiency.

Key words: target tracking, feature loss, Siamese network, residual attention, pixel correlation

摘要: 针对目标跟踪中网络模型加深导致低级特征丢失和速度与精度不平衡的问题,提出了一种融合注意力与残差连接的VIT跟踪算法MRATrans(multi-head residual attention Transformer)。使用MobileViT作为主干网络,在保持较低模型复杂度的同时实现了丰富的特征表示。提出残差注意力模块(residual attention module,RAM),有效防止低级特征丢失问题,并设计了多头残差注意力(multi-head residual attention,MHRA)模块,同时关注不同子空间表示的信息,进一步提高了模型的表达能力;构建密集像素相关(dense pixel correlation,DPC)模块计算模板和搜索区域的相似性,避免空间失真,获得具有更丰富语义信息的响应图;通过分类回归网络实现准确的跟踪。通过OTB100、VOT2018和GOT-10k数据集上的大量实验,证明了MRATrans与主流算法相比拥有更优的性能,并达到87 帧/s的速度,在保持高效性的同时实现了准确的跟踪。

关键词: 目标跟踪, 特征丢失, 孪生网络, 残差注意力, 像素互相关