Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (2): 180-190.DOI: 10.3778/j.issn.1002-8331.2211-0028

• Graphics and Image Processing • Previous Articles     Next Articles

Multi-Object Tracking Algorithm Based on CNN-Transformer Feature Fusion

ZHANG Yingjun, BAI Xiaohui, XIE Binhong   

  1. College of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China
  • Online:2024-01-15 Published:2024-01-15

CNN-Transformer特征融合多目标跟踪算法

张英俊,白小辉,谢斌红   

  1. 太原科技大学 计算机科学与技术学院,太原 030024

Abstract: In convolutional neural network (CNN), convolution can efficiently extract local features of the object, but it is difficult to capture global representation; in the visual Transformer, the attention mechanism can capture long-distance feature dependency, but will ignore local feature details. To solve the above problems, a multi-object tracking algorithm CTMOT (CNN transformer multi-object tracking) based on CNN-Transformer hybrid network for feature extraction and fusion is proposed. Firstly, the backbone network is adopted based on CNN and Transformer to extract the local and global features of the image respectively. Secondly, two way bridge module (TBM) is used to fully integrate two features. Then, the fused features are input to two parallel decoders for processing. Finally, the detection box and the tracking box outputted by the decoder are matched to obtain final tracking result and complete the multi-target tracking task. Evaluated on MOT17, MOT20, KITTI and UA-DETRAC multi-object tracking datasets, the MOTA indicators of CTMOT algorithm have reached 76.4%, 66.3%, 92.36% and 88.57% respectively. It is equivalent to the SOTA method on the MOT dataset, and achieves the SOTA effect on the KITTI dataset. At the same time, the MOTP and IDs indicators have reached the SOTA effect on all datasets. In addition, since the object detection and correlation are completed at the same time, the object tracking can be carried out end-to-end, and the tracking speed can reach 35 FPS, which shows that CTMOT algorithm achieves a good balance in the real-time and accuracy of tracking, and has great potential.

Key words: multi-object tracking, Transformer, feature fusion

摘要: 在卷积神经网络(CNN)中,卷积运算能高效地提取目标的局部特征,却难以捕获全局表示;而在视觉Transformer中,注意力机制可以捕获长距离的特征依赖,但会忽略局部特征细节。针对以上问题,提出一种基于CNN-Transformer双分支主干网络进行特征提取和融合的多目标跟踪算法CTMOT(CNN-transformer multi-object tracking)。使用基于CNN和Transformer双分支并行的主干网络分别提取图像的局部和全局特征。使用双向桥接模块(two-way braidge module,TBM)对两种特征进行充分融合。将融合后的特征输入两组并行的解码器进行处理。将解码器输出的检测框和跟踪框进行匹配,完成多目标跟踪任务。在多目标跟踪数据集MOT17、MOT20、KITTI以及UA-DETRAC上进行评估,CTMOT算法的MOTP和IDs指标在四个数据集上均达到了SOTA效果,MOTA指标分别达到了76.4%、66.3%、92.36%和88.57%,在MOT数据集上与SOTA方法效果相当,在KITTI数据集上达到SOTA效果。由于同时完成目标检测和关联,能够端到端进行目标跟踪,跟踪速度可达35?FPS,表明CTMOT算法在跟踪的实时性和准确性上达到了较好的平衡,具有较大潜力。

关键词: 多目标跟踪, Transformer, 特征融合