Transformer Single Target Tracking Algorithm Integrating Spatio-Temporal Information

doi:10.3778/j.issn.1002-8331.2307-0069

Abstract

Abstract: At present, the mainstream single target tracking method based on twin network matches the target by calculating the similarity between the template and the search area, but lacks the use of the space-time state information of the target. Especially when there are multiple similar targets in the scene, twin network trackers often cannot accurately distinguish the targets, resulting in tracking errors. To solve these problems, a single target tracking algorithm (SIFTransT) based on spatio-temporal information fusion in Transformer is proposed. Firstly, the algorithm obtains preliminary tracking results through MixFormer(end-to-end tracking with iterative mixed attention) tracker. Secondly, a target state calculation module is designed to calculate and store the target state information, including target position, boundary frame, speed, acceleration, movement direction, etc., in order to dig the target state information deeply. Finally, a spatial-temporal information fusion module based on Transformer is constructed, which uses the self-attention of encoder and cross-attention of decoder to deeply integrate the state information of the target in the latest period of time, so as to more accurately model the state of the target and improve the accuracy of target tracking. The experimental results on LaSOT data set show that compared with the benchmark algorithm MixFormer, SIFTransT algorithm has improved the AUC index by 2.8 percentage points, PNorm index by 2.6 percentage points and P index by 2.1 percentage points, and the average frame processing per second on the server equipped with RTX8000 graphics card has reached 28 frames.

Key words: single target tracking, target state calculation, attention mechanism, space-time information fusion

摘要： 目前，主流的基于孪生网络的单目标跟踪方法，通过计算模板与搜索区域之间的相似度来匹配目标，缺乏对目标时空状态信息的利用。特别是当场景中存在多个相似目标时，孪生网络跟踪器往往无法精确区分目标，从而导致跟踪错误。针对上述问题，提出一种融合时空信息的Transformer单目标跟踪算法（SIFTransT）。该算法通过MixFormer（end-to-end tracking with iterative mixed attention）跟踪器获取初步的跟踪结果，设计了一个目标状态计算模块，用于计算并存储目标的状态信息，包括目标位置、边界框、速度、加速度、运动方向等，以此深入挖掘目标状态信息。构建了一个基于Transformer的时空信息融合模块，利用编码器的自注意力和解码器的交叉注意力，深入融合目标最近一段时间的状态信息，从而更加准确地对目标状态进行建模，提高目标跟踪的准确性。在LaSOT数据集上的实验结果表明，相比基准算法MixFormer，SIFTransT算法在AUC指标提高了2.8个百分点，PNorm指标提升了2.6个百分点，P指标提升了2.1个百分点,在搭载RTX8000显卡的服务器上平均每秒处理帧数达28帧。

关键词: 单目标跟踪, 目标状态计算, 注意力机制, 时空信息融合

JIANG Jinbao, XUAN Shibin, FU Jie. Transformer Single Target Tracking Algorithm Integrating Spatio-Temporal Information[J]. Computer Engineering and Applications, 2024, 60(19): 230-241.

江进宝, 宣士斌, 付杰. 融合时空信息的Transformer单目标跟踪算法[J]. 计算机工程与应用, 2024, 60(19): 230-241.

References

[1] 韩瑞泽, 冯伟, 郭青, 等. 视频单目标跟踪研究进展综述[J]. 计算机学报, 2022, 45(9): 1877-1907.
HAN R Z, FENG W, GUO Q, et al. Single object tracking research: a survey[J]. Chinese Journal of Computers, 2022, 45(9): 1877-1907.
[2] BOLME D S, BEVERIDGE J R, DRAPER B A, et al. Visual object tracking using adaptive correlation filters[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010: 2544-2550.
[3] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014, 37(3): 583-596.
[4] DAI K , WANG D, LU H, et al. Visual tracking via adaptive spatially-regularized correlation filters[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4670-4679.
[5] BHAT G, JOHNANDER J, DANELLJAN M, et al. Unveiling the power of deep tracking[C]//Proceedings of European Conference on Computer Vision, 2018: 483-498.
[6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017, 30.
[7] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking[C]//Proceedings of European Conference on Computer Vision, 2016: 850-865.
[8] LI B, YAN J, WU W, et al. High performance visual tracking with siamese region proposal network[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980.
[9] CHEN X, YAN B, ZHU J, et al. Transformer tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 8126-8135.
[10] CUI Y, JIANG C, WANG L, et al. MixFormer: end-to-end tracking with iterative mixed attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 13608-13618.
[11] 孙开伟, 王支浩, 刘虎, 等. 基于注意力机制最大化重叠的单目标跟踪算法[J]. 计算机科学, 2023, 50(S1): 397-401.SUN K W, WANG Z H, LIU H, et al. Maximum overlap single target tracking algorithm based on attention mechanism[J].Computer Science, 2023, 50(S1): 397-401.
[12] 胡硕, 姚美玉, 孙琳娜, 等. 融合注意力特征的精确视觉跟踪[J]. 计算机科学与探索, 2023, 17(4): 868-878.
HU S, YAO M Y, SUN L N, et al. Accurate visual tracking with attention feature[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(4): 868-878.
[13] WU Y, LIM J, YANG M H. Object tracking benchmark[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 9(37): 1834-1848.
[14] YAN B, PENG H, FU J, et al. Learning spatio-temporal transformer for visual tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 10448-10457.
[15] MAYER C, DANELLJAN M, BHAT G, et al. Transforming model prediction for tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 8731-8740.
[16] 吉瑞萍, 张程祎, 梁彦, 等. 基于LSTM的弹道导弹主动段轨迹预报[J]. 系统工程与电子技术, 2022, 44(6): 1968-1976.
JI R P, ZHANG C Y, LIANG Y, et al. Trajectory prediction of boost-phase ballistic missile based on LSTM[J]. Systems Engineering and Electronics, 2022, 44(6): 1968-1976.
[17] CHEN X, PENG H, WANG D, et al. SeqTrack: sequence to sequence learning for visual object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 14572-14581.
[18] REZATOFIGHI H, TSOI N, GWAK J Y, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 658-666.
[19] ZHENG Z, WANG P, LIU W, et al. Distance-IoU loss: faster and better learning for bounding box regression[C]//Proceedings of AAAI Conference on Artificial Intelligence, 2020: 12993-13000.
[20] HUANG L, ZHAO X, HUANG K. GOT-10K: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562-1577.
[21] FAN H, LIN L, YANG F, et al. LaSOT: a high-quality benchmark for large-scale single object tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5374-5383.
[22] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Procedings of European Conference on Computer Vision, 2014: 740-755.
[23] MULLER M, BIBI A, GIANCOLA S, et al. TrackingNet: a large-scale dataset and benchmark for object tracking in the wild[C]//Proceedings of European Conference on Computer Vision, 2018: 300-317.
[24] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[J]. arXiv:1711. 05101, 2017.
[25] MAYER C, DANELLJAN M, PAUDEL D P, et al. Learning target candidate association to keep track of what not to track[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13444-13454.
[26] GUO D, SHAO Y, CUI Y, et al. Graph attention tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9543-9552.
[27] LI B, WU W, WANG Q, et al. SiamRPN++: evolution of siamese visual tracking with very deep networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291.
[28] BHAT G, DANELLJAN M, VAN GOOL L, et al. Know your surroundings: exploiting scene information for object tracking[C]//Proceedings of European Conference on Computer Vision, 2020: 205-221.
[29] BHAT G, DANELLJAN M, GOOL L V, et al. Learning discriminative model prediction for tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6182-6191.
[30] DANELLJAN M, BHAT G, KHAN F S, et al. ATOM: accurate tracking by overlap maximization[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4660-4669.
[31] 王春雷, 张建林, 李美惠, 等. 结合卷积Transformer的目标跟踪算法[J]. 计算机工程, 2023, 49(4): 281-288.
WANG C L, ZHANG J L, LI M H, et al. Object tracking algorithm combining convolution and transformer[J]. Computer Engineering, 2023, 49(4): 281-288.
[32] SONG Z, YU J, CHEN Y P P, et al.Transformer tracking with cyclic shifting window attention[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 8791-8800.
[33] ZHOU Z, CHEN J, PEI W, et al. Global tracking via ensemble of local trackers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 8761-8770.
[34] XIE F, WANG C, WANG G, et al. Correlation-aware deep tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 8751-8760.
[35] ZHOU Z, PEI W, LI X, et al. Saliency-associated object tracking[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9866-9875.
[36] XIE F, WANG C, WANG G, et al. Learning tracking representations via dual-branch fully transformer networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 2688-2697.
[37] YU B, TANG M, ZHENG L, et al. High-performance discriminative tracking with transformers[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9856-9865.
[38] CUI Y, JIANG C, WANG L, et al.Target transformed regression for accurate tracking[J]. arXiv:2104. 00403, 2021.