非局部稀疏关注的YOLOv4优化算法

doi:10.3778/j.issn.1002-8331.2207-0047

摘要/Abstract

摘要： 传统目标检测网络如Fast R-CNN、ReseNet等在下采样提取图像特征的过程中，损失了大量的空间位置信息表征，存在对较小目标检测效果差的问题。在保留空间位置信息的基础上，提出了一种非局部稀疏关注的级联残差高分辨率网络（cascaded residual high resolution network）。该网络架构从一个高分辨率的子网络开始，逐步增加从高到低分辨率的子网络，形成更多的阶段，将多个分辨率的子网络并行连接，使用级联残差模块（cascaded residual module，CrModule）进行同分辨率特征流间的特征提取；利用多尺度特征图融合，使得每个从高到低分辨率的表示反复地从其他并行表示接收信息，产生丰富语义表征和空间位置表征的高分辨率表示；引入NLSA（non-local sparse attention）算法实现深层网络特征块超分重构，挖掘不同尺度相同物体间的结构关联，提高较小物体的特征表示，使之与大物体特征类似，提升较小目标的特征可学习性。在VOC2007数据集的广泛评估表明，将CrHRnet作为YOLOv4的主干特征提取网络，能有效提高目标检测的准确率；CrHRnet-YOLOv4测试mAP（mean average precision）比YOLOv4、YOLOv5_s、YOLOv5_m分别高出1.8、9.5、3.4个百分点，在相同的设备下检测单张图片的FPs较YOLOv4网络提升了30%。

关键词: 目标检测, 高分辨率表示, 级联残差, NLSA算法

Abstract: Traditional target detection networks such as Fast R-CNN and ReseNet lose a large amount of spatial location information representation in the process of downsampling to extract image features, and have the problem of poor detection of smaller targets. On the basis of preserving spatial location information, a cascaded residual high resolution network CrHRnet（cascaded residual high resolution network） with non-local sparse attention is proposed. The network architecture starts from a high-resolution sub-network and gradually adds sub-networks from high to low resolution to form more stages, connects multiple resolution sub-networks in parallel, and a cascaded residual module（CrModule） is used for feature extraction between streams of same-resolution features; multi-scale feature map fusion is used to make each representation from high to low resolution repeatedly and information is received from other parallel representations to produce high-resolution representations rich in semantic representations and spatial location representations; NLSA（non-local sparse attention） algorithm is introduced to realize the deep network feature block hyper-segmentation reconstruction, to explore the structural association between objects of different scales, to improve the feature representation of smaller objects to make them similar to larger objects, and to enhance the feature learnability of smaller targets. Extensive evaluation on the VOC2007 dataset shows that using CrHRnet as the backbone feature extraction network for YOLOv4 can effectively improve the accuracy of target detection; the CrHRnet-YOLOv4 test mAP（mean average precision） is 1.8, 9.5, and 3.4 percentage points higher than YOLOv4, YOLOv5_s, YOLOv5_m, respectively. Under the same hardware conditions, the frames per second (FPS) for single-image detection is increased by 30% compared to the YOLOv4 network.

Key words: object detection, high resolution representation, cascade residual, non-local sparse attention（NLSA）

闵锋, 毛一新, 侯泽铭, 杨朝源, 王名茂. 非局部稀疏关注的YOLOv4优化算法[J]. 计算机工程与应用, 2023, 59(21): 123-131.

MIN Feng, MAO Yixin, HOU Zeming, YANG Chaoyuan, WANG Mingmao. YOLOv4 Optimization Algorithm with Non-Local Sparse Concern[J]. Computer Engineering and Applications, 2023, 59(21): 123-131.

参考文献

[1] LIN T，MAIRE M，BELONGIE S，et al.Microsoft coco：common objects in context[C]//13th European Conference on Computer Vision（ECCV 2014），Zurich，Switzerland，September 6-12，2014：740-755.
[2] GIRSHICK R，DONAHUE J，DARRELL T，et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2014：580-587.
[3] REDMON J，DIVVALA S，GIRSHICK R，et al.You only look once：unified，real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：779-788.
[4] REDMON J，FARHADI A.YOLO9000：better，faster，stronger[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：7263-7271.
[5] REDMON J，FARHADI A.Yolov3：an incremental improvement[J].arXiv：1804.02767，2018.
[6] BOCHKOVSKIY A，WANG C，LIAO H.Yolov4：optimal speed and accuracy of object detection[J].arXiv：2004. 10934，2020.
[7] GIRSHICK R.Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：1440-1448.
[8] HE K，ZHANG X，REN S，et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2015，37（9）：1904-1916.
[9] REN S，HE K，GIRSHICK R，et al.Faster R-CNN：towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence，2017，39（6）：1137-1149.
[10] LIU W，ANGUELOV D，ERHAN D，et al.SSD：single shot multibox detector[J].Proceedings 14th European Conference on Computer Vision（ECCV 2016），Amsterdam，The Netherlands，October 11-14，2016：21-37.
[11] 李成豪，张静，胡莉，等.基于多尺度感受野融合的小目标检测算法[J].计算机工程与应用，2022，58（12）：177-182.
LI C H，ZHANG J，HU L，et al.Small object detection algorithm based on multiscale receptive field fusion[J].Computer Engineering and Applications，2022，58（12）：177-182.
[12] 刘建政，梁鸿，崔学荣，等.融入特征融合与特征增强的SSD目标检测[J].计算机工程与应用，2022，58（11）：150-159.
LIU J Z，LIANG H，CUI X R，et al.SSD visual target detector based on feature integration and feature enhancement[J].Computer Engineering and Applications，2022，58（11）：150-159.
[13] KONG T，YAO A，CHEN Y，et al.HyperNet：towards accurate region proposal generation and joint object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：845-853.
[14] KIM S，KOOK H，SUN J，et al.Parallel feature pyramid network for object detection[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：234-250.
[15] 向华桥，崔文超，刘世焯，等.基于高表征能力特征处理模块的小目标检测[J].计算机工程与设计，2021，42（5）：1360-1367.
XIANG H Q，CUI W C，LIU S Z，et al.Small object detection based on high characterization capability feature processing module[J].Computer Engineering and Design，2021，42（5）：1360-1367.
[16] 李松江，吴宁，王鹏，等.基于改进Cascade RCNN的车辆目标检测方法[J].计算机工程与应用，2021，57（5）：123-130.
LI S J，WU N，WANG P，et al.Vehicle target detection method based on improved cascade RCNN[J].Computer Engineering and Applications，2021，57（5）：123-130.
[17] 赵鹏飞，谢林柏，彭力.融合注意力机制的深层次小目标检测算法[J].计算机科学与探索，2022，16（4）：927-937.
ZHAO P F，XIE L B，PENG L.Deep small object detection algorithm integrating attention mechanism[J].Journal of Frontiers of Computer Science and Technology，2022，16（4）：927-937.
[18] 窦允冲，侯进，曾雷鸣，等.基于反馈机制与空洞卷积的道路小目标检测网络[J].计算机工程，2023，49（1）：287-294.
DOU Y C，HOU J，ZENG L M，et al.Road small target detection network based on feedback mechanism and dilated convolution[J].Computer Engineering，2023，49（1）：287-294.
[19] SUN K，XIAO B，LIU D，et al.Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5693-5703.
[20] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[21] SIMONYAN K，ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv：1409.1556，2014.
[22] MEI Y，FAN Y，ZHOU Y.Image super-resolution with non-local sparse attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：3517-3526.
[23] EVERINGHAM M，VAN GOOL L，WILLIAMS C K I，et al.The pascal visual object classes（voc） challenge[J].International Journal of Computer Vision，2010，88（2）：303-338.
[24] HOWARD A，ZHU M，CHEN B，et al.Mobilenets：efficient convolutional neural networks for mobile vision applications[J].arXiv：1704.04861，2017.
[25] SANDLER M，HOWARD A，ZHU M，et al.Mobilenetv2：inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：4510-4520.
[26] HOWARD A，SANDLER M，CHU G，et al.Searching for mobilenetv3[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：1314-1324.
[27] CAO J，PANG Y，HAN J，et al.Hierarchical shot detector[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：9705-9714.
[28] MAAZ M，RASHEED H，KHAN S，et al.Multi-modal transformers excel at class-agnostic object detection[J].arXiv：2111.11430，2021.
[29] ZHU Y，ZHAO C，WANG J，et al.Couplenet：coupling global structure with local parts for object detection[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：4126-4134.
[30] GHIASI G，CUI Y，SRINIVAS A，et al.Simple copy-paste is a strong data augmentation method for instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：2918-2928.