Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (20): 192-199.DOI: 10.3778/j.issn.1002-8331.2306-0023

• Graphics and Image Processing • Previous Articles     Next Articles

Fusion of Sparse Attention and Time Query for Video Object Detection

MEI Siyi, LIU Yanlong   

  1. College of Information and Computer, Taiyuan University of Technology, Jinzhong, Shanxi 030600, China
  • Online:2023-10-15 Published:2023-10-15



  1. 太原理工大学 信息与计算机学院,山西 晋中 030600

Abstract: In video object detection task, detection accuracy is affected by multiple factors, including changes in the appearance of the detected object over time, jitter of the video file, blurring of a single frame image caused by defocusing, ghosting, etc. To improve the accuracy of object detection in video files and address the issue of blurring in object edge detection, an improved end-to-end video object detection network is proposed. On the one hand, by introducing a sparse attention mechanism, the object foreground is more focused, reducing attention dispersion and background interference, and improving the accuracy of edge detection. On the other hand, a time fusion query module is introduced, which utilizes shallow encoders with more information to link reference frames for time queries, achieving feature fusion across different time contexts and feature enhancement of target frames. In addition, the motion blur of the target is supplemented by sparse selection of reference frames from far and near distances, while reducing feature redundancy. The model is evaluated on two datasets, ImageNet VID and UA-DETRAC, with an accuracy of 92.3% and 90.9%, respectively. The experimental results show that the proposed model performs better in video object detection tasks and has improved overall performance compared to other advanced networks.

Key words: object detection, video object detection, sparse attention mechanism, object query

摘要: 在视频目标检测任务中,检测精度受到多重因素影响,包括检测对象随时间的外观变化、视频文件的抖动、散焦导致单帧图像的模糊、重影等,为提高视频文件的目标检测精度、改善目标边缘检测模糊的问题,提出一种改进的端到端的视频目标检测网络。一方面,通过引入稀疏注意力机制使目标前景更加聚焦,减少注意力分散和背景干扰,提升边缘检测的精准度;另一方面,引入时间融合查询模块,利用具有更多信息的浅层编码器链接参考帧的时间查询,实现跨时间上下文的特征融合和目标帧的特征增强。此外,通过利用远近距离稀疏地选取参考帧来补充目标的运动模糊,同时减少冗余。在ImageNet VID和UA-DETRAC这两个数据集上分别对模型进行评估,准确率可达到92.3%和90.9%。实验结果表明,所提模型在视频目标检测任务上效果更好,综合性能较其他先进网络有所提升。

关键词: 目标检测, 视频目标检测, 稀疏注意力机制, 对象查询