Low-Light Object Detection Based on Feature Interaction Structure

doi:10.3778/j.issn.1002-8331.2302-0051

Abstract

Abstract: Aiming at the problem that the current mainstream and advanced object detection algorithms have low detection accuracy in low-light scenes, it is analyzed that the low-light image weakens the local correlation induction bias that the traditional convolutional neural network relies on, and introduces the Swin Transformer stage with excellent modeling ability for global features to achieve global attention and enhance the amount of feature information. The global attention is combined with local convolution to extract the features of low-light image in parallel, and a feature interaction structure (FIS) is proposed. Through the carefully designed secondary interaction mode, local and global information can be effectively analyzed, utilized and combined. The interactive parallel dual-stream backbone network FISNet is constructed based on the FIS stack, which realizes the deep fusion of the two types of features, and provides a hierarchical feature structure that is very important for intensive predictive tasks. FISNet has achieved 40.6 AP on the low-light image data set ExDark, and has achieved +0.5~2.9 AP detection accuracy improvement compared with the benchmark model such as EfficientNet, which has good application in low-light object detection scenarios.

Key words: low-light images, object detection, global features, feature interaction structure

摘要： 针对当前主流、先进的目标检测算法在弱光场景下对目标检测精度较低的问题，分析弱光图像削弱了传统卷积神经网络依赖的局部相关性归纳偏置，引入对全局特征有着出色建模能力的Swin Transformer stage以实现全局注意，增强特征信息量。将全局注意以并行方式与局部卷积共同抽取弱光图像特征，并提出了一种特征交互结构（feature interaction structure，FIS），通过精心设计的二次交互方式，能有效解析、利用和结合局部与全局信息。基于FIS堆叠构造交互式并行双流骨干网络FISNet，实现对两类特征的深度融合，并提供对密集预测型任务十分重要的层级特征结构。FISNet在弱光图像数据集ExDark上达到了40.6?AP，与EfficientNet等基准模型相比，得到了+0.5~2.9?AP的检测精度提升，在弱光目标检测场景中具有良好的应用。

关键词: 弱光图像, 目标检测, 全局特征, 特征交互结构

MAI Jinwen, LI Hao, KANG Yan. Low-Light Object Detection Based on Feature Interaction Structure[J]. Computer Engineering and Applications, 2024, 60(11): 224-232.

麦锦文, 李浩, 康雁. 基于特征交互结构的弱光目标检测[J]. 计算机工程与应用, 2024, 60(11): 224-232.

References

[1] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580-587.
[2] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440-1448.
[3] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39: 1137-1149.
[4] HE K, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2961-2969.
[5] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the European Conference on Computer Vision, 2016: 21-37.
[6] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[J]. arXiv:2004.10934, 2020.
[7] JOCHER G. YOLOv5 by Ultralytics (v6.1)[EB/OL]. (2022-02-22)[2023-01-02]. https://github.com/ultralytics/yolov5.
[8] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779-788.
[9] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 7263-7271.
[10] REDMON J, FARHADI A. YOLOv3: an incremental improvement[J]. arXiv:1804.02767, 2018.
[11] DING X, ZHANG X, ZHOU Y, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNN-s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 11963-11975.
[12] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the Neural Information Processing Systems, 2017: 5998-6008.
[14] BA J L, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016.
[15] HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[J]. arXiv:1606.08415, 2016.
[16] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision, 2014: 740-755.
[17] 徐光达, 毛国君. 多层级特征融合的无人机航拍图像目标检测[J]. 计算机科学与探索, 2023, 17(3): 635-645.
XU G D, MAO G J. Aerial image object detection of UAV based on multi-level feature fusion[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 635-645.
[18] 田卓钰, 马苗, 杨楷芳. 基于级联注意力与点监督机制的考场目标检测模型[J]. 软件学报, 2022, 33(7): 2633-2645.
TIAN Z Y, MA M, YANG K F. Object detection model for examination classroom based on cascade attention and point supervision mechanism[J]. Journal of Software, 2022, 33(7): 2633-2645.
[19] 王剑哲, 吴秦. 坐标注意力特征金字塔的显著性目标检测算法[J]. 计算机科学与探索, 2023, 17(1): 154-165.
WANG J Z, WU Q. Salient object detection based on coordinate attention feature pyramid[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 154-165.
[20] Papers with Code. COCO test-dev Benchmark[EB/OL]. (2022-06-22)[2023-01-02]. https://paperswithcode.com/sota/object-det-ection-on-coco.
[21] CUI Z, LI K, GU L, et al. You only need 90K parameters to adapt light: a light weight transformer for image enhancement and exposure correction[C]//Proceedings of the British Machine Vision Conference, 2022.
[22] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision, 2020: 213-229.
[23] JIANG Q, MAO Y, CONG R, et al. Unsupervised decomposition and correction network for low-light image enhancement[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(10): 19440-19455.
[24] LIU R, MA L, MA T, et al. Learning with nested scene modeling and cooperative architecture search for low-light vision[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(5): 1-17.
[25] LIU W, REN G, YU R, et al. Image-adaptive YOLO for object detection in adverse weather conditions[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 1792-1800.
[26] HONG Y, WEI K, CHEN L, et al. Crafting object detection in very low light[C]//Proceedings of the British Machine Vision Conference, 2021.
[27] CUI Z, QI G J, GU L, et al. Multitask AET with orthogonal tangent regularity for dark object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 2553-2562.
[28] ZHANG H, HAO K, PEDRYCZ W, et al. Vision transformer with convolutions architecture search[J]. arXiv:2203.10435, 2022.
[29] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[30] WANG C Y, LIAO H Y M, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of the?IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 390-391.
[31] PENG Z, HUANG W, GU S, et al. Conformer: local features coupling global representations for visual recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 367-376.
[32] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[33] CHEN Q, WU Q, WANG J, et al. MixFormer: mixing features across windows and dimensions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 5249-5259.
[34] IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the International Conference on Machine Learning, 2015: 448-456.
[35] ELFWING S, UCHIBE E, DOYA K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning[J]. Neural Networks, 2018, 107: 3-11.
[36] LOH Y P, CHAN C S. Getting to know low-light images with the exclusively dark dataset[J]. Computer Vision and Image Understanding, 2019, 178: 30-42.
[37] ZHANG H, CHANG H, MA B, et al. Dynamic R-CNN: towards high quality object detection via dynamic training[C]//Proceedings of the European Conference on Computer Vision, 2020: 260-275.
[38] CAI Z, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6154-6162.
[39] RADOSAVOVIC I, KOSARAJU R P, GIRSHICK R, et al. Designing network design spaces[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10428-10436.
[40] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5693-5703.
[41] TAN M, LE Q. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the International Conference on Machine Learning, 2019: 6105-6114.
[42] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980-2988.
[43] CHEN K, WANG J, PANG J, et al. MMDetection: open MMLab detection toolbox and benchmark[J]. arXiv:1906.
07155, 2019.
[44] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2117-2125.
[45] LIU S, QI L, QIN H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8759-8768.