Fusion of Sparse Attention and Time Query for Video Object Detection

doi:10.3778/j.issn.1002-8331.2306-0023

Abstract

Abstract: In video object detection task, detection accuracy is affected by multiple factors, including changes in the appearance of the detected object over time, jitter of the video file, blurring of a single frame image caused by defocusing, ghosting, etc. To improve the accuracy of object detection in video files and address the issue of blurring in object edge detection, an improved end-to-end video object detection network is proposed. On the one hand, by introducing a sparse attention mechanism, the object foreground is more focused, reducing attention dispersion and background interference, and improving the accuracy of edge detection. On the other hand, a time fusion query module is introduced, which utilizes shallow encoders with more information to link reference frames for time queries, achieving feature fusion across different time contexts and feature enhancement of target frames. In addition, the motion blur of the target is supplemented by sparse selection of reference frames from far and near distances, while reducing feature redundancy. The model is evaluated on two datasets, ImageNet VID and UA-DETRAC, with an accuracy of 92.3% and 90.9%, respectively. The experimental results show that the proposed model performs better in video object detection tasks and has improved overall performance compared to other advanced networks.

Key words: object detection, video object detection, sparse attention mechanism, object query

摘要： 在视频目标检测任务中，检测精度受到多重因素影响，包括检测对象随时间的外观变化、视频文件的抖动、散焦导致单帧图像的模糊、重影等，为提高视频文件的目标检测精度、改善目标边缘检测模糊的问题，提出一种改进的端到端的视频目标检测网络。一方面，通过引入稀疏注意力机制使目标前景更加聚焦，减少注意力分散和背景干扰，提升边缘检测的精准度；另一方面，引入时间融合查询模块，利用具有更多信息的浅层编码器链接参考帧的时间查询，实现跨时间上下文的特征融合和目标帧的特征增强。此外，通过利用远近距离稀疏地选取参考帧来补充目标的运动模糊，同时减少冗余。在ImageNet VID和UA-DETRAC这两个数据集上分别对模型进行评估，准确率可达到92.3%和90.9%。实验结果表明，所提模型在视频目标检测任务上效果更好，综合性能较其他先进网络有所提升。

关键词: 目标检测, 视频目标检测, 稀疏注意力机制, 对象查询

MEI Siyi, LIU Yanlong. Fusion of Sparse Attention and Time Query for Video Object Detection[J]. Computer Engineering and Applications, 2023, 59(20): 192-199.

梅思怡, 刘彦隆. 融合稀疏注意力和时间查询的视频目标检测[J]. 计算机工程与应用, 2023, 59(20): 192-199.

References

[1] 王迪聪，白晨帅，邬开俊.基于深度学习的视频目标检测综述[J].计算机科学与探索，2021，15（9）：1563-1577.
WANG D C，BAI C S，WU K J.Survey of video object detection based on deep learning[J].Journal of Frontiers of Computer Science and Technology，2021，15（9）：1563-1577.
[2] 贾天豪，彭力.残差学习与循环注意力下的SSD目标检测算法[J].计算机科学，2023，50（5）：170-176.
JIA T H，PENG L.SSD object detection algorithm with residual learning and cyclic attention[J].Computer Science，2023，50（5）：170-176.
[3] 肖雨晴，杨慧敏.目标检测算法在交通场景中应用综述[J].计算机工程与应用，2021，57（06）：30-41.
XIAO Y Q，YANG H M.Research on application of object detection algorithm in traffic scene[J].Journal of Computer Engineering and Applications，2021，57（6）：30-41.
[4] ZHU X，WANG Y，DAI J，et al.Flow-guided feature aggregation for video object detection[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：408-417.
[5] 尉婉青，禹晶，史薪琪，等.双光流网络指导的视频目标检测[J].中国图象图形学报，2021，26（10）：2473-2484.
YU W Q，YU J，SHI X Q，et al.Dual optical flow network-guided video object detection[J].Journal of Image and Graphics，2021，26（10）：2473-2484.
[6] KANG K，OUYANG W L，LI H S，et al.Object detection from video tubelets with convolutional neural networks[C]//Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition，2016：817-825.
[7] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16x16 words：transformers for image recognition at scale[J].arXiv：2010.11929，2020.
[8] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[J].arXiv：1706.03762，2017.
[9] CARION N，MASSA F，SYNNAEVE G，et al.End-to-end object detection with transformers[C]//European Conference on Computer Vision，2020：213-229.
[10] ZHU X，SU W，LU L，et al.Deformable DETR：deformable transformers for end-to-end object detection[J].arXiv：2010.04159，2020.
[11] ZHU X，XIONG Y，DAI J，et al.Deep feature flow for video recognition[C]]//Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition，2017：2349-2358.
[12] YAO C，FANG C，SHEN X，et al.Video object detection via object-level temporal aggregation[C]//European Conference on Computer Vision，2020：160-177.
[13] JIANG Z，LIU Y，YANG C，et al.Learning where to focus for efficient video object detection[C]//European Conference on Computer Vision，2020：18-34.
[14] FUJITAKE M，SUGIMOTO A.Video sparse transformer with attention-guided memory for video object detection[J].IEEE Access，2022，10：65886-65900.
[15] ZHOU Q，LI X，HE L，et al.Transvod：end-to-end video object detection with spatial-temporal transformers[J].arXiv：2201.05047，2022.
[16] WANG H，TANG J，LIU X，et al.PTSEFormer：progressive temporal-spatial enhanced transformer towards video object detection[C]//European Conference on Computer Vision，2022：732-747.
[17] ZHAO G，LIN J，ZHANG Z，et al.Explicit sparse transformer：concentrated attention through explicit selection[J].arXiv：1912.11637，2019.
[18] RUSSAKOVSKY O，DENG J，SU H，et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision，2015，115（3）：211-252.
[19] WEN L，DU D，CAI Z，et al.UA-DETRAC：a new benchmark and protocol for multi-object detection and tracking[J].Computer Vision and Image Understanding，2020，193：102907.
[20] CHEN Y，CAO Y，HU H，et al.Memory enhanced global-local aggregation for video object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition，2020：10337-10346.
[21] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition，2016：770-778.
[22] LIU Z，LIN Y，CAO Y，et al.Swin transformer：hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：10012-10022.
[23] DENG J，PAN Y，YAO T，et al.Relation distillation networks for video object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：7023-7032.
[24] SHVETS M，LIU W，BERG A.Leveraging long-range temporal relationships between proposals for video object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：9756-9764.
[25] WU H，CHEN Y，WANG N，et al.Sequence level semantics aggregation for video object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：9217-9225.
[26] XU Z，HRUSTIC E，VIVET D.Centernet heatmap propagation for real-time video object detection[C]//European Conference on Computer Vision，2020：220-234.
[27] KIM K，KIM P，CHUNG Y，et al.Performance enhancement of YOLOv3 by adding prediction layers with spatial pyramid pooling for vehicle detection[C]//Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance（AVSS），2018：1-6.
[28] KIM K，KIM P，CHUNG Y，et al.Multi-scale detector for accurate vehicle detection in traffic surveillance data[J].IEEE Access，2019，7：78311-78319.
[29] PERREAULT H，BILODEAU G，SAUNIER N，et al.Spotnet：self-attention multi-task network for object detection[C]//Proceedings of the IEEE International Conference on Computer and Robot Vision（CRV），2020：230-237.
[30] PERREAULT H，BILODEAU G，SAUNIER N，et al.FFAVOD：feature fusion architecture for video object detection[J].Pattern Recognition Letters，2021，151：294-301.
[31] GUO C，FAN B，GU J，et al.Progressive sparse local attention for video object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：3909-3918.
[32] CHIN T，DING R，MARCULESCU D.Adascale：towards real-time video object detection using adaptive scaling[J].Proceedings of Machine Learning and Systems，2019，1：431-441.
[33] WANG S，ZHOU Y，YAN J，et al.Fully motion-aware network for video object detection[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：542-557.
[34] CHEN K，WANG J，YANG S，et al.Optimizing video object detection via a scale-time lattice[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7814-7823.
[35] HAN M，WANG Y，CHANG X，et al.Mining inter-video proposal relations for video object detection[C]//European Conference on Computer Vision，2020：431-446.
[36] BERTASIUS G，WANG H，TORRESANI L.Is space-time attention all you need for video understanding[C]//Proceedings of International Conference Machine Learning，2021：813-824.