计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (21): 123-131.DOI: 10.3778/j.issn.1002-8331.2207-0047

• 模式识别与人工智能 • 上一篇    下一篇

非局部稀疏关注的YOLOv4优化算法

闵锋,毛一新,侯泽铭,杨朝源,王名茂   

  1. 武汉工程大学 智能机器人湖北省重点实验室,武汉 430205
  • 出版日期:2023-11-01 发布日期:2023-11-01

YOLOv4 Optimization Algorithm with Non-Local Sparse Concern

MIN Feng, MAO Yixin, HOU Zeming, YANG Chaoyuan, WANG Mingmao   

  1. Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China
  • Online:2023-11-01 Published:2023-11-01

摘要: 传统目标检测网络如Fast R-CNN、ReseNet等在下采样提取图像特征的过程中,损失了大量的空间位置信息表征,存在对较小目标检测效果差的问题。在保留空间位置信息的基础上,提出了一种非局部稀疏关注的级联残差高分辨率网络(cascaded residual high resolution network)。该网络架构从一个高分辨率的子网络开始,逐步增加从高到低分辨率的子网络,形成更多的阶段,将多个分辨率的子网络并行连接,使用级联残差模块(cascaded residual module,CrModule)进行同分辨率特征流间的特征提取;利用多尺度特征图融合,使得每个从高到低分辨率的表示反复地从其他并行表示接收信息,产生丰富语义表征和空间位置表征的高分辨率表示;引入NLSA(non-local sparse attention)算法实现深层网络特征块超分重构,挖掘不同尺度相同物体间的结构关联,提高较小物体的特征表示,使之与大物体特征类似,提升较小目标的特征可学习性。在VOC2007数据集的广泛评估表明,将CrHRnet作为YOLOv4的主干特征提取网络,能有效提高目标检测的准确率;CrHRnet-YOLOv4测试mAP(mean average precision)比YOLOv4、YOLOv5_s、YOLOv5_m分别高出1.8、9.5、3.4个百分点,在相同的设备下检测单张图片的FPs较YOLOv4网络提升了30%。

关键词: 目标检测, 高分辨率表示, 级联残差, NLSA算法

Abstract: Traditional target detection networks such as Fast R-CNN and ReseNet lose a large amount of spatial location information representation in the process of downsampling to extract image features, and have the problem of poor detection of smaller targets. On the basis of preserving spatial location information, a cascaded residual high resolution network CrHRnet(cascaded residual high resolution network) with non-local sparse attention is proposed. The network architecture starts from a high-resolution sub-network and gradually adds sub-networks from high to low resolution to form more stages, connects multiple resolution sub-networks in parallel, and a cascaded residual module(CrModule) is used for feature extraction between streams of same-resolution features; multi-scale feature map fusion is used to make each representation from high to low resolution repeatedly and information is received from other parallel representations to produce high-resolution representations rich in semantic representations and spatial location representations; NLSA(non-local sparse attention) algorithm is introduced to realize the deep network feature block hyper-segmentation reconstruction, to explore the structural association between objects of different scales, to improve the feature representation of smaller objects to make them similar to larger objects, and to enhance the feature learnability of smaller targets. Extensive evaluation on the VOC2007 dataset shows that using CrHRnet as the backbone feature extraction network for YOLOv4 can effectively improve the accuracy of target detection; the CrHRnet-YOLOv4 test mAP(mean average precision) is 1.8, 9.5, and 3.4 percentage points higher than YOLOv4, YOLOv5_s, YOLOv5_m, respectively. Under the same hardware conditions, the frames per second (FPS) for single-image detection is increased by 30% compared to the YOLOv4 network.

Key words: object detection, high resolution representation, cascade residual, non-local sparse attention(NLSA)