Review of YOLO Methods for Universal Object Detection

doi:10.3778/j.issn.1002-8331.2404-0130

Abstract

Abstract: As the first single-stage object detection algorithm in the era of deep learning, YOLO has sparked a wave of enthusiasm in the field of computer vision with its powerful and unique paradigm, and has become a milestone achievement in object detection algorithms. It is still a typical algorithm that achieves the best balance between speed and accuracy, and is widely used in industrial fields such as autonomous driving and intelligent vision systems. In the past eight years, driven by deep learning technology, YOLO methods have developed rapidly and have profound impact on the entire field of object detection. This paper conducts an in-depth investigation of the YOLO method related work from the perspective of technological evolution, comprehensively summarizing the innovation and contributions of each iteration from the initial YOLO v1 to the latest YOLO v9 and YOLO v10. Based on the significant technological improvements at different time points, the YOLO method is divided into four parts: early basic YOLO, standard version YOLO, standard improvement YOLO, and unique improvement YOLO. The unique perspectives of the improvement methods in each period are introduced in detail. In addition, the dataset and indicators for evaluating the YOLO method are summarized, and detailed experimental results of different versions of YOLO and different models of the same version of YOLO are collected. The development and changes of YOLO are summarized from both macro and micro levels. Through analysis, the differences and inherent connections in the development framework, backbone network architecture, and prior box usage among different versions of YOLO are revealed, emphasizing the importance of balancing speed and accuracy in YOLO. Finally, through systematic review, the future development trends of YOLO method is summarized.

Key words: deep learning, computer vision, object detection, YOLO method

摘要： 作为深度学习时代首个单阶段目标检测算法，YOLO以其强大且独特的范式在计算机视觉领域掀起了一股热潮，并成为目标检测算法的里程碑式成果,至今为止仍是在速度与精度之间实现最佳平衡的典型算法，广泛应用于自动驾驶、智能视觉系统等工业领域。过去的八年里，在深度学习技术的驱动下， YOLO方法有了快速发展并对整个目标检测领域产生深远影响。从技术进化角度深入调查YOLO方法相关工作，对最初的YOLO v1到最新的YOLO v9与YOLO v10每一次迭代创新和贡献进行全面总结，根据不同时间节点的和技术的重大改进将YOLO方法分为早期基础YOLO、标准版本YOLO、标准改进YOLO和独特改进YOLO四部分，详细介绍每个时期改进方法的独特视角。此外，总结评估YOLO方法的数据集与指标，收集了不同版本YOLO、同一版本YOLO不同型号的详细实验结果，从宏观层面与微观层面归纳YOLO的发展变化，通过分析揭示各版本YOLO之间的开发框架、骨干网络架构、先验框使用情况等技术的差异和内在联系，强调了YOLO在速度与准确率之间平衡的重要性。最后通过系统的梳理，凝练YOLO方法未来的发展趋势。

关键词: 深度学习, 计算机视觉, 目标检测, YOLO方法

MI Zeng, LIAN Zhe. Review of YOLO Methods for Universal Object Detection[J]. Computer Engineering and Applications, 2024, 60(21): 38-54.

米增, 连哲. 面向通用目标检测的YOLO方法研究综述[J]. 计算机工程与应用, 2024, 60(21): 38-54.

References

[1] PAPAGEORGIOU C P, OREN M, POGGIO T. A general framework for object detection[C]//Proceedings of the IEEE Sixth International Conference on Computer Vision, 1998: 555-562.
[2] VIOLA P, JONES M. Rapid object detection using a boosted cascade of simple features[C]//Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001.
[3] DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005: 886-893.
[4] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems, 2012.
[5] GIRSHICK R, DONAHUE J, DARRELL T, et al. Region-based convolutional networks for accurate object detection and segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(1): 142-158.
[6] GIRSHICK R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440-1448.
[7] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779-788.
[8] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 7263-7271.
[9] REDMON J, FARHADI A. YOLOv3: an incremental improvement[J]. arXiv:1804.02767, 2018.
[10] BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[J]. arXiv:2004.10934, 2020.
[11] NELSON J, SOLAWETZ J. YOLOv5 is here: state-of-the-art object detection at 140 FPS[EB/OL].[2020-06-10]. https://blog.roboflow.com/yolov5-is-here/.
[12] LI C, LI L, JIANG H, et al. YOLOv6: a single-stage object detection framework for industrial applications[J]. arXiv:2209.02976, 2022.
[13] WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 7464-7475.
[14] GE Z, LIU S, WANG F, et al. YOLOX: exceeding YOLO series in 2021[J]. arXiv:2107.08430, 2021.
[15] WANG C Y, YEH I H, LIAO H Y M. You only learn one representation: unified network for multiple tasks[J]. arXiv:2105.04206, 2021.
[16] XU S, WANG X, LV W, et al. PP-YOLOE: an evolved version of YOLO[J]. arXiv:2203.16250, 2022.
[17] GALLAGHER J. How to train an ultralytics YOLOv8 oriented bounding box (OBB) model[EB/OL]. [2024-02-06]. https://blog.roboflow.com/train-yolov8-obb-model/.
[18] WANG C Y, YEH I H, LIAO H Y M. YOLOv9: learning what you want to learn using programmable gradient information[J]. arXiv:2402.13616, 2024.
[19] CHEN Y, YUAN X, WU R, et al. YOLO-MS: rethinking multi-scale representation learning for real-time object detection[J]. arXiv:2308.05480, 2023.
[20] WANG A, CHEN H, LIU L, et al. YOLOv10: real-time end-to-end object detection[J]. arXiv:2405.14458, 2024.
[21] WANG C, HE W, NIE Y, et al. Gold-YOLO: efficient object detector via gather-and-distribute mechanism[C]//Advances in Neural Information Processing Systems, 2024.
[22] FANG Y, LIAO B, WANG X, et al. You only look at one sequence: rethinking transformer in vision through object detection[C]//Advances in Neural Information Processing Systems, 2021: 26183-26197.
[23] XU X, JIANG Y, CHEN W, et al. DAMO-YOLO: a report on real-time object detection design[J]. arXiv:2211.15444, 2022.
[24] SKALSKI P. How to train YOLO-NAS on a custom dataset[EB/OL].[2023-05-16]. https://blog.roboflow.com/yolo-nas-how-to-train-on-custom-dataset/.
[25] 王琳毅, 白静, 李文静, 等. YOLO系列目标检测算法研究进展[J]. 计算机工程与应用, 2023, 59(14): 15-29.
WANG L Y, BAI J, LI W J, et al. Research progress of YOLO series target detection algorithms[J]. Computer Engineering and Applications, 2023, 59(14): 15-29.
[26] 茅智慧, 朱佳利, 吴鑫, 等. 基于YOLO的自动驾驶目标检测研究综述[J]. 计算机工程与应用, 2022, 58(15): 68-77.
MAO Z H, ZHU J L, WU X, et al. Review of YOLO based target detection for autonomous driving[J]. Computer Engineering and Applications, 2022, 58(15): 68-77.
[27] 朱弥雪, 刘志强, 张旭, 等. 林火视频烟雾检测算法综述[J]. 计算机工程与应用, 2022, 58(14): 16-26.
ZHU M X, LIU Z Q, ZHANG X, et al. Review of research on video-based smoke detection algorithms[J]. Computer Engineering and Applications, 2022, 58(14): 16-26.
[28] JIANG P, ERGU D, LIU F, et al. A review of YOLO algorithm developments[J]. Procedia Computer Science, 2022, 199: 1066-1073.
[29] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The pascal visual object classes (voc) challenge[J]. International Journal of Computer Vision, 2010, 88: 303-338.
[30] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of 13th European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, September 6-12, 2014. Cham: Springer International Publishing, 2014: 740-755.
[31] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2117-2125.
[32] WANG C Y, LIAO H Y M, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 390-391.
[33] WU Y, CHEN Y, YUAN L, et al. Rethinking classification and localization for object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10186-10195.
[34] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 213-229.
[35] CHU X, LI L, ZHANG B. Make RepVGG greater again: a quantization-aware approach[J]. arXiv:2212.01593, 2022.
[36] 许晓阳, 高重阳. 改进YOLOv7-tiny的轻量级红外车辆目标检测算法[J]. 计算机工程与应用, 2024, 60(1): 74-83.
XU X Y, GAO C Y. Improved YOLOv7-tiny lightweight infrared vehicle target detection algorithm[J]. Computer Engineering and Applications, 2024, 60(1): 74-83.
[37] 张华卫, 张文飞, 蒋占军, 等. 引入上下文信息和Attention Gate的GUS-YOLO遥感目标检测算法[J]. 计算机科学与探索, 2024, 18(2): 453-464.
ZHANG H W, ZHANG W F, JIANG Z J, et al. GUS-YOLO remote sensing target detection algorithm introducing context information and Attention Gate[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2):453-464.
[38] 何湘杰, 宋晓宁. YOLOv4-Tiny的改进轻量级目标检测算法[J]. 计算机科学与探索, 2024, 18(1): 138-150.
HE X J, SONG X Y. Improved YOLOv4-Tiny lightweight target detection algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1):138-150.
[39] ZHOU H, JIANG F, LU H. SSDA-YOLO: semi-supervised domain adaptive YOLO for cross-domain object detection[J]. Computer Vision and Image Understanding, 2023, 229: 103649.
[40] WEI J, WANG Q, ZHAO Z. YOLO-G: improved YOLO for cross-domain object detection[J]. Plos One, 2023, 18(9): e0291241.
[41] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[42] HO J, JAIN A, ABBEEL P. Denoising diffusion probabil-istic models[C]//Advances in Neural Information Processing Systems, 2020: 6840-6851.
[43] ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2223-2232.
[44] YOON J, JARRETT D, VAN DER SCHAAR M. Time-series genera-tive adversarial networks[C]//Advances in Neural Information Processing Systems, 2019.
[45] KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4401-4410.
[46] LI H, YANG Y, CHANG M, et al. Srdiff: single image super-resolution with diffusion probabilistic models[J]. Neurocomputing, 2022, 479: 47-59.
[47] KHADER F, MUELLER-FRANZES G, ARASTEH S T, et al. Medical diffusion: denoising diffusion probabilistic models for 3d medical image generation[J]. arXiv:2211.03364, 2022.
[48] ZHENG Q, TIAN X, YU Z, et al. MobileRaT: a lightweight radio transformer method for automatic modulation classification in drone communication systems[J]. Drones, 2023, 7(10): 596.
[49] ZHENG Q, SAPONARA S, TIAN X, et al. A real-time constellation image classification method of wireless communication signals based on the lightweight network MobileViT[J]. Cognitive Neurodynamics, 2024, 18: 659-671.
[50] 王春梅, 刘欢. YOLOv8-VSC: 一种轻量级的带钢表面缺陷检测算法[J]. 计算机科学与探索, 2024, 18(1): 151-160.
WANG C M, LIU H. YOLOv8-VSC: lightweight algorithm for strip surface defect detection[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 151-160.
[51] ZHOU J, ZHANG B, YUAN X, et al. YOLO-CIR: the network based on YOLO and ConvNeXt for infrared object detection[J]. Infrared Physics & Technology, 2023, 131: 104703.
[52] 连哲, 殷雁君, 云飞, 等. 基于深度学习的自然场景文本检测综述[J]. 计算机工程, 2024, 50(3): 16-27.
LIAN Z, YIN Y J, YUN F, et al. Review of natural scene text detection based on deep learning[J]. Computer Engineering, 2024, 50(3): 16-27.