UAV Visual Tracking with Lightweight Transformer

doi:10.3778/j.issn.1002-8331.2208-0312

Abstract

Abstract: As UAV is widely used in military and civilian fields, the demand for high-precision, low-power intelligent UAV tracking systems gradually increases. Focusing on the problem that the target tracking algorithm is difficult to balance the tracking accuracy and tracking speed in the UAV tracking scene, a Siamese network UAV target tracking algorithm is proposed to introduce a lightweight Transformer, named SiamLT. The AlexNet network is improved using Transformer to capture global feature information while increasing the minimum computational effort. In terms of feature map matching, a binary correlation module is proposed by combining Transformer and deep cross-correlation operation, which simultaneously captures local and global dependencies between target templates and search regions. The distance intersection ratio is introduced into the classification and regression network, and a multi-supervised strategy is used to train the network to obtain more accurate target locations. Experimental results on the UAV123 and UAV20L tracking benchmarks show that SiamLT algorithm outperforms the mainstream target tracking algorithms, which balances tracking accuracy and tracking speed more effectively.

Key words: unmanned aerial vehicle (UAV), object tracking, Transformer, Siamese network, multi-head attention

摘要： 随着无人机在军事和民用领域的广泛运用，对于高精度、低功耗智能无人机跟踪系统的需求日益增加。针对目标跟踪算法在无人机跟踪场景下很难平衡跟踪精度和跟踪速度的问题，提出一种引入轻量级Transformer的孪生网络无人机目标跟踪算法SiamLT。使用Transformer对AlexNet网络进行改进，在增加最小计算量的情况下捕获全局特征信息。在目标模板与搜索区域匹配方面，联合Transformer和深度互相关运算提出一种二元相关模块，同时捕获目标模板与搜索区域之间的局部相关性和全局依赖关系。在分类回归网络中引入距离交并比，并采用多监督策略训练网络，以获取更准确的目标位置。在UAV123和UAV20L跟踪基准上的实验结果表明，SiamLT算法优于主流的目标跟踪算法，更有效地平衡了跟踪精度和跟踪速度。

关键词: 无人机, 目标跟踪, Transformer, 孪生网络, 多头注意力

SHEN Haiyun, WANG Haichuan, HUANG Zhongyi, YU Honghao. UAV Visual Tracking with Lightweight Transformer[J]. Computer Engineering and Applications, 2024, 60(2): 244-253.

谌海云, 王海川, 黄忠义, 余鸿皓. 引入轻量级Transformer的无人机视觉跟踪[J]. 计算机工程与应用, 2024, 60(2): 244-253.

References

[1] 闫超, 涂良辉, 王聿豪, 等. 无人机在我国民用领域应用综述[J]. 飞行力学, 2022, 40(3): 1-6.
YAN C, TU L H, WANG L H, et al. Application of unmanned aerial vehicle in civil field in China[J]. Flight Mechanics, 2022, 40(3): 1-6.
[2] 孟琭, 杨旭. 目标跟踪算法综述[J]. 自动化学报, 2019, 45(7): 1244-1260.
MENG L, YANG X. A survey of object tracking algorithms[J]. Acta Automatica Sinica, 2019, 45(7): 1244-1260.
[3] 林淑彬, 吴贵山, 许甲云, 等. 多帧监督的相关滤波无人机目标跟踪[J]. 计算机工程与应用, 2021, 57(24): 152-160.
LIN S B, WU G S, XU J Y, et al. Multi-frame surveillance of correlation filter in UAV object tracking[J]. Computer Engineering and Applications, 2021, 57(24): 152-160.
[4] DANELLJAN M, HAGER G, KHAN F, et al. Accurate scale estimation for robust visual tracking[C]//British Machine Vision Conference, 2014.
[5] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37(3): 583-596.
[6] LI F, TIAN C, ZUO W, et al. Learning spatial-temporal regularized correlation filters for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, 2018: 4904-4913.
[7] LI X, MA C, WU B, et al. Target-aware deep tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019: 1369-1378.
[8] WANG N, SONG Y, MA C, et al. Unsupervised deep tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 2019: 1308-1317.
[9] LI Y, FU C, DING F, et al. AutoTrack: towards high-performance visual tracking for UAV with automatic spatio-temporal regularization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 11923-11932.
[10] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 850-865.
[11] LI B, YAN J, WU W, et al. High performance visual tracking with Siamese region proposal network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8971-8980.
[12] WANG Q, ZHANG L, BERTINETTO L, et al. Fast online object tracking and segmentation: a unifying approach[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington DC: IEEE Computer Society, 2019: 1328-1338.
[13] LI B, WU W, WANG Q, et al. SiamRPN++: evolution of Siamese visual tracking with very deep networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 4282-4291.
[14] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington DC: IEEE Computer Society, 2016: 770-778.
[15] XU Y, WANG Z, LI Z, et al. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of the AAAI Conference on Artificial Intelligence, New York, 2020: 12549-12556.
[16] UASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017, 30: 1-15.
[17] CAO Z, FU C, YE J, et al. HiFT: hierarchical feature Transformer for aerial tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, 2021: 15457-15466.
[18] 江英杰, 宋晓宁. 基于视觉Transformer的双流目标跟踪算法[J]. 计算机工程与应用, 2022, 58(12): 183-190.
JIANG Y J, SONG X N. Dual-stream object tracking algorithm based on vision Transformer[J]. Computer Engineering and Applications, 2022, 58(12): 183-190.
[19] CAO Z, HUANG Z, PAN L, et al. TCTrack: temporal contexts for aerial tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition. Washington DC: IEEE Computer Society, 2022: 14798-14808.
[20] WANG N, ZHOU W, WANG J, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington DC: IEEE Computer Society, 2021: 1571-1580.
[21] YAN B, PENG H, FU J, et al. Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington DC: IEEE Computer Society, 2021: 10448-10457.
[22] CUI Y, JIANG C, WANG L, et al. MixFormer: end-to-end tracking with iterative mixed attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington DC: IEEE Computer Society, 2022: 13608-13618.
[23] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. Washington DC: IEEE Computer Society, 2012: 1097-1105.
[24] ZHENG Z, WANG P, LIU W, et al. Distance-IoU loss: faster and better learning for bounding box regression[C]//Proceedings of the AAAI Conference on Artificial Intelligence, New York, 2020: 12993-13000.
[25] DOSSVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[26] HUANG L, ZHAO X, HUANG K. GOT-10k: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562-1577.
[27] FAN H, LIN L, YANG F, et al. LaSOT: a high-quality benchmark for large-scale single object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington DC: IEEE Computer Society, 2019: 5374-5383.
[28] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252.
[29] MUELLER M, SMITH N, GHANEM B. A benchmark and simulator for UAV tracking[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 445-461.
[30] MA C, YANG X, ZHANG C, et al. Long-term correlation tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington DC: IEEE Computer Society, 2015: 5388-5396.
[31] DANELLJAN M, HAGER G, SHAHBAZ KHAN F, et al. Learning spatially regularized correlation filters for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington DC: IEEE Computer Society, 2015: 4310-4318.
[32] KIANI GALOOGAHI H, FAGG A, LUCEY S. Learning background aware correlation filters for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, 2017: 1135-1143.
[33] DANELLJAN M, BHAT G, SHAHBAZ KHAN F, et al. ECO: efficient convolution operators for tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 6638-6646.
[34] ZHANG Z, PENG H. Deeper and wider siamese networks for real-time visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington DC: IEEE Computer Society, 2019: 4591-4600.
[35] 杨帅东, 谌海云, 许瑾, 等. 利用深度卷积特征的无人机视觉跟踪[J]. 控制与决策, 2023, 38(9): 2496-2504.
YANG S D, SHEN H Y, XU J, et al. Visual tracking using deep convolutional feature[J]. Control and Decision, 2023, 38(9): 2496-2504.