三维卷积神经网络方法改进及其应用综述

doi:10.3778/j.issn.1002-8331.2407-0031

摘要/Abstract

摘要： 三维卷积神经网络作为一种深度神经网络，在计算机视觉领域，特别是视频动作识别方面展现了优异的效果。然而三维卷积神经网络仍存在一些问题，针对这些问题，对现有的基于三维卷积的视频动作识别改进方法进行了总结和分析。在轻量化、特征提取、计算效率、组合模型等方面对三维卷积神经网络的改进进行归纳，并介绍了三维卷积神经网络的实际应用，总结了流行的数据集，并对这些改进方法的实验结果进行了比较和分析。展望了视频动作识别未来的发展方向。

关键词: 三维卷积神经网络（3DCNN）, 行为识别, 深度学习

Abstract: 3D convolutional neural network, as a kind of deep neural network, has shown excellent results in the field of computer vision, especially in video action recognition. However, there are still some problems in 3D convolutional neural networks. In order to solve these problems, this paper summarizes and analyzes the existing improved methods of video action recognition based on 3D convolution. The improvement of 3D convolutional neural network is summarized in the aspects of lightweight, feature extraction, computational efficiency, combination model, etc. The practical application of 3D convolutional neural network is introduced, the popular data sets are summarized, and the experimental results of these improved methods are compared and analyzed. Finally, the future development direction of video action recognition is prospected.

Key words: 3D convolutional neural networks （3DCNN）, behavior recognition, deep learning

李泽慧, 张琳, 山显英. 三维卷积神经网络方法改进及其应用综述[J]. 计算机工程与应用, 2025, 61(3): 48-61.

LI Zehui, ZHANG Lin, SHAN Xianying. Review on Improvement and Application of 3D Convolutional Neural Networks[J]. Computer Engineering and Applications, 2025, 61(3): 48-61.

参考文献

[1] VISHWAKARMA S, AGRAWAL A. A survey on activity recognition and behavior understanding in video surveillance[J]. The Visual Computer, 2013, 29: 983-1009.
[2] YANG C, CHEN D, XU Z. Action recognition system for security monitoring[C]//Proceedings of the IEEE International Conference on Artificial Intelligence, Virtual Reality, and Visualization (AIVRV 2021), 2021: 62-67.
[3] CHOU E, TAN M, ZOU C, et al. Privacy-preserving action recognition for smart hospitals using low-resolution depth images[J]. arXiv:1811.09950, 2018.
[4] PENG F T, ZHANG H. Research on action recognition method of dance video image based on human‐computer interaction[J]. Scientific Programming, 2021, 2021(1): 8763133.
[5] SUN Y, XUE B, ZHANG M, et al. Evolving deep convolutional neural networks for image classification[J]. IEEE Transactions on Evolutionary Computation, 2019, 24(2): 394-407.
[6] ROGERS S K, COLOMBI J M, MARTIN C E, et al. Neural networks for automatic target recognition[J]. Neural Networks, 1995, 8(7/8): 1153-1184.
[7] HU K, JIN J, ZHENG F, et al. Overview of behavior recognition based on deep learning[J]. Artificial Intelligence Review, 2023, 56(3): 1833-1865.
[8] GONZALEZ R C. Deep convolutional neural networks[J]. IEEE Signal Processing Magazine, 2018, 35(6): 79-87.
[9] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[10] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems, 2012, 25.
[11] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1-9.
[12] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[13] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231.
[14] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015: 4489-4497.
[15] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.
[16] HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6546-6555.
[17] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2818-2826.
[18] QIU Z, YAO T, MEI T. Learning spatio-temporal representation with Pseudo-3D residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5533-5541.
[19] TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6450-6459.
[20] XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 305-321.
[21] TOSHPULATOV M, LEE W, LEE S, et al. DDC3N: Doppler-driven convolutional 3D network for human action recognition[J]. IEEE Access, 2024, 12: 93546-93567.
[22] TRAN D, WANG H, TORRESANI L, et al. Video classification with channel-separated convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 5552-5561.
[23] 易子文, 孙中华, 冯金超, 等. 用于行为识别的通道可分离卷积神经网络[J]. 信号处理, 2020, 36(9): 1497-1502.
YI Z W, SUN Z H, FENG J C, et al. Channel separable convolutional neural network for action recognition[J]. Journal of Signal Processing, 2020, 36(9): 1497-1502.
[24] 李秀智, 张冉, 贾松敏. 面向助老行为识别的三维卷积神经网络设计[J]. 北京工业大学学报, 2021, 47(6): 589-597.
LI X Z, ZHANG R, JIA S M. Design of 3D convolutional neural network for action recognition for helping the aged[J]. Journal of Beijing University of Technology, 2021, 47(6): 589-597.
[25] FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6202-6211.
[26] FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 203-213.
[27] LUO C, YUILLE A L. Grouped spatial-temporal aggregation for efficient action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 5512-5521.
[28] ZHU M, BIN S, SUN G. Lite‐3DCNN combined with attention mechanism for complex human movement recognition[J]. Computational Intelligence and Neuroscience, 2022, 2022(1): 4816549.
[29] ZHANG Z, TAKEDA M, IWATA M. Multi-pooling 3D convolutional neural network for fMRI classification of visual brain states[C]//Proceedings of the 2023 IEEE Conference on Artificial Intelligence (CAI), 2023: 118-119.
[30] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. arXiv:1503.02531, 2015.
[31] STROUD J, ROSS D, SUN C, et al. D3D: distilled 3D networks for video action recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020: 625-634.
[32] ULLAH H, MUNIR A. A 3DCNN-based knowledge distillation framework for human activity recognition[J]. Journal of Imaging, 2023, 9(4): 82.
[33] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7794-7803.
[34] JIANG G, JIANG X, FANG Z, et al. An efficient attention module for 3D convolutional neural networks in action recognition[J]. Applied Intelligence, 2021，51: 7043-7057.
[35] ULLAH H, MUNIR A. Human action representation learning using an attention-driven residual 3DCNN network[J]. Algorithms, 2023, 16(8): 369.
[36] WANG Y, ZHU A, MA H, et al. 3D-ShuffleVit: an efficient video action recognition network with deep integration of self-attention and convolution[J]. Mathematics, 2023, 11(18): 3848.
[37] ZHANG L, ZHU G, SHEN P, et al. Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017: 3120-3128.
[38] LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 7083-7093.
[39] WANG L M, LI WEI, LI WEN, et al. Appearance-and-relation networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 1430-1439.
[40] HUANG, ZHEN, et al. 3D local convolutional neural networks for gait recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 14920-14929.
[41] HU Y, SHUAI Z, YANG H, et al. ESDAR-Net: towards high-accuracy and real-time driver action recognition for embedded systems[J]. Multimedia Tools and Applications, 2024, 83(6): 18281-18307.
[42] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
[43] KORBAR B, TRAN D, TORRESANI L. SCSampler: sampling salient clips from video for efficient action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6232-6242.
[44] DESHPANDE A, WARHADE K. A robust human activity recognition system using 3D CNN[J]. International Journal of Computing and Digital Systems, 2023, 14(1): 10553-10563.
[45] WANG P, CAO Y, SHEN C, et al. Temporal pyramid pooling-based convolutional neural network for action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 27(12): 2613-2622.
[46] YANG C, XU Y, SHI J, et al. Temporal pyramid network for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 591-600.
[47] XIAO F, LEE Y J, GRAUMAN K, et al. Audiovisual slowfast networks for video recognition[J]. arXiv:2001.08740, 2020.
[48] 朱铮宇, 罗超, 贺前华, 等. 基于唇重构与三维耦合CNN的多视角音唇一致性判别[J]. 华南理工大学学报 (自然科学版), 2023, 51(5): 70-77.
ZHU Z Y, LUO C, HE Q H, et al. Multi-view lip motion and voice consistency judgment based on lip reconstruction and three-dimensional coupled CNN[J]. Journal of South China University of Technology (Natural Science Edition) , 2023, 51(5): 70-77.
[49] CHEN H, LI Y, FANG H, et al. Multi-scale attention 3D convolutional network for multimodal gesture recognition [J]. Sensors, 2022, 22(6): 2405.
[50] KONDRATYUK D, YUAN L Z, LI Y D, et al. MoViNets: mobile video networks for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 16020-16030.
[51] ZHOU Y, SUN X, ZHA Z J, et al. MiCT: mixed 3D/2D convolutional tube for human action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 449-458.
[52] KUMAWAT S, VERMA M, NAKASHIMA Y, et al. Depthwise spatio-temporal STFT convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(9): 4839-4851.
[53] 钱佳明, 娄文启, 宫磊, 等. 面向3D-CNN的算法压缩-硬件设计协同优化[J]. 计算机工程与应用, 2023, 59(18): 74-83.
QIAN J M, LOU W Q, GONG L, et al. Algorithm compression and hardware design co-optimization for 3D-CNN[J]. Computer Engineering and Applications, 2023, 59(18): 74-83.
[54] WANG T, LI J, ZHANG M, et al. An enhanced 3DCNN‐ConvLSTM for spatiotemporal multimedia data analysis[J]. Concurrency and Computation: Practice and Experience, 2021, 33(2): e5302.
[55] REN P, XIAO G, CHANG X, et al. NAS-TC: neural architecture search on temporal convolutions for complex action recognition[J]. arXiv:2104.01110, 2021.
[56] ZHANG S, GUO S, HUANG W, et al. V4D: 4D convolutional neural networks for video-level representation learning[J]. arXiv:2002.07442, 2020.
[57] KIM J, CHA S, WEE D, et al. Regularization on spatio-temporally smoothed feature for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12103-12112.
[58] SINGH D P, RAY L S S, ZHOU B, et al. A novel local-global feature fusion framework for body-weight exercise recognition with pressure mapping sensors[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024: 6375-6379.
[59] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4694-4702.
[60] CHANDA B, NYEEM H. Multilevel fusion with dual stream 3DCNN-LSTM for advancing dynamic hand gesture recognition[C]//Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), 2023: 1-6.
[61] DEEPTHI G, NASEEBA B, PALVADI M. 3DCNN-GRU based video controller through hand gestures[J]. International Journal for Innovative Engineering & Management Research, Forthcoming, 2023.
[62] LI Z, ZHANG Y, XING H, et al. Facial micro-expression recognition using double-stream 3D convolutional neural network with domain adaptation[J]. Sensors, 2023, 23(7): 3577.
[63] 黄敏, 尚瑞欣, 钱惠敏. 面向视频中人体行为识别的复合型深度神经网络[J]. 模式识别与人工智能, 2022, 35(6): 562-570.
HUANG M, SHANG R X, QIAN H M. Composite deep neural network for human activities recognition in video [J]. Pattern Recognition and Artificial Intelligence, 2022, 35(6): 562-570.
[64] LIN Z H, HUANG S Y, WANG Y C F. Convolution in the cloud: learning deformable kernels in 3D graph convolution networks for point cloud analysis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1800-1809.
[65] LI J, LIU X, ZONG Z, et al. Graph attention based proposal 3D convnets for action detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 4626-4633.
[66] DAI R, DAS S, KAHATAPITIYA K, et al. MS-TCT: multi-scale temporal convtransformer for action detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20041-20051.
[67] JASWAL G, SRIRANGARAJAN S, ROY S D. Range-Doppler hand gesture recognition using deep residual-3DCNN with transformer network[C]//Proceedings of the International Conference on Pattern Recognition. Cham: Springer, 2021: 759-772.
[68] LE T H, LE T M, NGUYEN T A. Action identification with fusion of BERT and 3DCNN for smart home systems[J]. Internet of Things, 2023, 22: 100811.
[69] KUMAWAT S, RAMAN S. LP-3DCNN: unveiling local phase in 3D convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4903-4912.
[70] LI W, GAO Y, CHEN J, et al. Human gesture recognition based on millimeter-wave radar using improved C3D convolutional neural network[J]. Journal of Computers, 2023, 34(3): 1-18.
[71] NGOC H N, XUAN N N, BUI T H, et al. An efficient approach for real-time abnormal human behavior recognition on surveillance cameras[C]//Proceedings of the 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), 2023: 1-6.
[72] JIANG L, ZOU B, LIU S, et al. Recognition of abnormal human behavior in dual-channel convolutional 3D construction site based on deep learning[J]. Neural Computing and Applications, 2023, 35(12): 8733-8745.
[73] ALAMEEN S A, ALHOTHALI A M. A lightweight driver drowsiness detection system using 3DCNN with LSTM[J]. Computer Systems Science & Engineering, 2023, 44(1): 895-912.
[74] WANG Y, LI R, WANG Z, et al. E3D: an efficient 3D CNN for the recognition of dairy cow’s basic motion behavior[J]. Computers and Electronics in Agriculture, 2023, 205: 107607.
[75] LI D, ZHANG K, LI Z, et al. A spatiotemporal convolutional network for multi-behavior recognition of pigs[J]. Sensors, 2020, 20(8): 2381.
[76] WANG J H, HSU T H, LAI Y C, et al. Anomalous behavior recognition of underwater creatures using lite 3D full-convolution network[J]. Scientific Reports, 2023, 13(1): 20051.
[77] LIANG C, LIANG Z. The application of deep convolution neural network in volleyball video behavior recognition[J]. IEEE Access, 2022, 10: 125908-125919.
[78] SONG X, FAN L. Human posture recognition and estimation method based on 3D multiview basketball sports dataset[J]. Complexity, 2021, 2021: 1-10.
[79] SINGH S P, WANG L, GUPTA S, et al. Shallow 3D CNN for detecting acute brain hemorrhage from medical imaging sensors[J]. IEEE Sensors Journal, 2020, 21(13): 14290-14299.
[80] DE SALES CARVALHO N R, RODRIGUES M C L C, DE CARVALHO FILHO A O, et al. Automatic method for glaucoma diagnosis using a three-dimensional convoluted neural network[J]. Neurocomputing, 2021, 438: 72-83.
[81] SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[82] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[J]. arXiv:1705.06950, 2017.
[83] CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about kinetics-600[J]. arXiv:1808.01340, 2018.
[84] CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the kinetics-700 human action dataset[J]. arXiv:1907. 06987, 2019.
[85] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the 2011 International Conference on Computer Vision, 2011: 2556-2563.
[86] WAN J, LIN C, WEN L, et al. ChaLearn looking at people: IsogD and ConGD large-scale RGB-D gesture recognition[J]. IEEE Transactions on Cybernetics, 2020, 52(5): 3422-3433.
[87] SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1010-1019.
[88] KANG S M, WILDES R P. Review of action recognition and detection methods[J]. arXiv:1610. 06906, 2016.