Dual-Stream Object Tracking Algorithm Based on Vision Transformer

doi:10.3778/j.issn.1002-8331.2203-0035

Abstract

Abstract: Transformer based object tracking algorithms mainly use Transformer to fuse deep convolution features, ignoring the ability of Transformer in feature extraction and decoding prediction. To mitigate the above problems, a dual-stream object tracking algorithm based on vision Transformer is proposed. Swin Transformer based on attention mechanism is introduced for feature extraction, and global information modeling is performed by shifting windows. The Transformer encoder is used to fully fuse the target features and the search region features, and the decoder is used to learn the location information in the target query. Then, target prediction is performed separately for the dual-stream information in the encoder-decoder. Further weighted fusion at the decision level is used to obtain the final tracking result, and a multi-supervised strategy is used. The proposed algorithm achieves state-of-the-art results on four challenging large-scale tracking datasets, LaSOT, TrackingNet, UAV123 and NFS, reaching area under the curve of success rate of 67.4%, 80.9%, 68.6%, and 66.0%, respectively, demonstrating its strong potential. Furthermore, end-to-end object tracking is enabled with a tracking speed of 42 FPS due to the avoidance of complex post-processing steps.

Key words: object tracking, deep learning, siamese network, Transformer, attention mechanism

摘要： 目前基于Transformer的目标跟踪算法主要利用Transformer来融合深度卷积特征，忽略了Transformer在特征提取和解码预测方面的能力。针对上述问题，提出一种基于视觉Transformer的双流目标跟踪算法。引入基于注意力机制的Swin Transformer进行特征提取，通过移位窗口进行全局信息建模。使用Transformer编码器对目标特征和搜索区域特征进行充分融合，使用解码器学习目标查询中的位置信息。分别对编解码器中的双流信息进行目标预测。在决策层面上进一步地加权融合得到最终跟踪结果，并使用多监督策略。该算法在LaSOT、TrackingNet、UAV123和NFS四个具有挑战性的大规模跟踪数据集上取得了先进的结果，分别达到67.4%、80.9%、68.6%和66.0%的成功率曲线下面积，展示了其强大的潜力。此外，由于避免了复杂的后处理步骤，能够端到端进行目标跟踪，跟踪速度可达42?FPS。

关键词: 目标跟踪, 深度学习, 孪生网络, Transformer, 注意力机制

JIANG Yingjie, SONG Xiaoning. Dual-Stream Object Tracking Algorithm Based on Vision Transformer[J]. Computer Engineering and Applications, 2022, 58(12): 183-190.

江英杰, 宋晓宁. 基于视觉Transformer的双流目标跟踪算法[J]. 计算机工程与应用, 2022, 58(12): 183-190.

References

[1] 邱守猛，谷宇章，袁泽强.基于双分支孪生网络的目标跟踪[J].计算机工程与应用，2021，57（24）：135-143.
QIU S M，GU Y Z，YUAN Z Q.Double adjust head siamese network for object tracking[J].Computer Engineering and Applications，2021，57（24）：135-143.
[2] 陈云芳，吴懿，张伟.基于孪生网络结构的目标跟踪算法综述[J].计算机工程与应用，2020，56（6）：10-18.
CHEN Y F，WU Y，ZHANG W.Survey of target tracking algorithm based on siamese network structure[J].Computer Engineering and Applications，2020，56（6）：10-18.
[3] 阮晨钊，张祥森，刘科，等.深度学习的人-物体交互检测研究进展[J].计算机科学与探索，2022，16（2）：323-336.
RUAN C Z，ZHANG X S，LIU K，et al.Progress on human-object interaction detection of deep learning[J].Journal of Frontiers of Computer Science and Technology，2022，16（2）：323-336.
[4] 王宁，席茂，周文罡，等.深度视觉目标跟踪进展综述[J].中国科学技术大学学报，2021，51（4）：335-344.
WANG N，XI M，ZHOU W G，et al.Recent advance in deep visual object tracking[J].Journal of University of Science and Technology of China，2021，51（4）：335-344.
[5] 缪佳妮，杨金龙，程小雪，等.运动信息优化相关滤波的多目标跟踪算法[J].计算机科学与探索，2021，15（7）：1310-1321.
MIAO J N，YANG J L，CHENG X X，et al.Multi-target tracking algorithm based on motion information optimized correlation filtering[J].Journal of Frontiers of Computer Science and Technology，2021，15（7）：1310-1321.
[6] TAO R，GAVVES E，SMEULDERS A W M.Siamese instance search for tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2016：1420-1429.
[7] BERTINETTO L，VALMADRE J，HENRIQUES J F，et al.Fully-convolutional siamese networks for object tracking[C]//Proceedings of European Conference on Computer Vision.Berlin，Germany：Springer，2016：850-865.
[8] LI B，YAN J，WU W，et al.High performance visual tracking with Siamese region proposal network[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2018：8971-8980.
[9] LI B，WU W，WANG Q，et al.Siamrpn++：evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2019：4282-4291.
[10] ZHANG Z，PENG H，FU J，et al.Ocean：object-aware anchor-free tracking[C]//European Conference on Computer Vision.Cham：Springer，2020：771-787.
[11] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[J].Advances in Neural Information Processing Systems，2017，30.
[12] WANG N，ZHOU W，WANG J，et al.Transformer meets tracker：exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2021：1571-1580.
[13] ZHAO M，OKADA K，INABA M.Trtr：visual tracking with transformer[J].arXiv：2105.03817，2021.
[14] YAN B，PENG H，FU J，et al.Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Washington D C，USA：IEEE Press，2021：10448-10457.
[15] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2016：770-778.
[16] LIU Z，LIN Y，CAO Y，et al.Swin transformer：hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Washington D C，USA：IEEE Press，2021：10012-10022.
[17] REZATOFIGHI H，TSOI N，GWAK J Y，et al.Generalized intersection over union：a metric and a loss for bounding box regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2019：658-666.
[18] RUSSAKOVSKY O，DENG J，SU H，et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision，2015，115（3）：211-252.
[19] SRIVASTAVA N，HINTON G，KRIZHEVSKY A，et al.Dropout：a simple way to prevent neural networks from overfitting[J].The Journal of Machine Learning Research，2014，15（1）：1929-1958.
[20] FAN H，LIN L，YANG F，et al.Lasot：a high-quality benchmark for large-scale single object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2019：5374-5383.
[21] HUANG L，ZHAO X，HUANG K.Got-10k：a large high-diversity benchmark for generic object tracking in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2019，43（5）：1562-1577.
[22] MULLER M，BIBI A，GIANCOLA S，et al.Trackingnet：a large-scale dataset and benchmark for object tracking in the wild[C]//Proceedings of the European Conference on Computer Vision（ECCV）.Berlin，Germany：Springer，2018：300-317.
[23] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft coco：common objects in context[C]//European Conference on Computer Vision.Cham：Springer，2014：740-755.
[24] LOSHCHILOV I，HUTTER F.Decoupled weight decay regularization[J].arXiv：1711.05101，2017.
[25] MUELLER M，SMITH N，GHANEM B.A benchmark and simulator for uav tracking[C]//European Conference on Computer Vision.Cham：Springer，2016：445-461.
[26] KIANI G H，FAGG A，HUANG C，et al.Need for speed：a benchmark for higher frame rate object tracking[C]//Proceedings of the IEEE International Conference on Computer Vision.Washington D C，USA：IEEE Press，2017：1125-1134.
[27] DANELLJAN M，BHAT G，KHAN F S，et al.Atom：accurate tracking by overlap maximization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2019：4660-4669.
[28] NAM H，HAN B.Learning multi-domain convolutional neural networks for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Washington D C，USA：IEEE Press，2016：4293-4302.
[29] XU Y，WANG Z，LI Z，et al.Siamfc++：towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：12549-12556.
[30] BHAT G，DANELLJAN M，GOOL L V，et al.Learning discriminative model prediction for tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Washington D C，USA：IEEE Press，2019：6182-6191.