计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (3): 48-61.DOI: 10.3778/j.issn.1002-8331.2407-0031
李泽慧,张琳,山显英
出版日期:
2025-02-01
发布日期:
2025-01-24
LI Zehui, ZHANG Lin, SHAN Xianying
Online:
2025-02-01
Published:
2025-01-24
摘要: 三维卷积神经网络作为一种深度神经网络,在计算机视觉领域,特别是视频动作识别方面展现了优异的效果。然而三维卷积神经网络仍存在一些问题,针对这些问题,对现有的基于三维卷积的视频动作识别改进方法进行了总结和分析。在轻量化、特征提取、计算效率、组合模型等方面对三维卷积神经网络的改进进行归纳,并介绍了三维卷积神经网络的实际应用,总结了流行的数据集,并对这些改进方法的实验结果进行了比较和分析。展望了视频动作识别未来的发展方向。
李泽慧, 张琳, 山显英. 三维卷积神经网络方法改进及其应用综述[J]. 计算机工程与应用, 2025, 61(3): 48-61.
LI Zehui, ZHANG Lin, SHAN Xianying. Review on Improvement and Application of 3D Convolutional Neural Networks[J]. Computer Engineering and Applications, 2025, 61(3): 48-61.
[1] VISHWAKARMA S, AGRAWAL A. A survey on activity recognition and behavior understanding in video surveillance[J]. The Visual Computer, 2013, 29: 983-1009. [2] YANG C, CHEN D, XU Z. Action recognition system for security monitoring[C]//Proceedings of the IEEE International Conference on Artificial Intelligence, Virtual Reality, and Visualization (AIVRV 2021), 2021: 62-67. [3] CHOU E, TAN M, ZOU C, et al. Privacy-preserving action recognition for smart hospitals using low-resolution depth images[J]. arXiv:1811.09950, 2018. [4] PENG F T, ZHANG H. Research on action recognition method of dance video image based on human‐computer interaction[J]. Scientific Programming, 2021, 2021(1): 8763133. [5] SUN Y, XUE B, ZHANG M, et al. Evolving deep convolutional neural networks for image classification[J]. IEEE Transactions on Evolutionary Computation, 2019, 24(2): 394-407. [6] ROGERS S K, COLOMBI J M, MARTIN C E, et al. Neural networks for automatic target recognition[J]. Neural Networks, 1995, 8(7/8): 1153-1184. [7] HU K, JIN J, ZHENG F, et al. Overview of behavior recognition based on deep learning[J]. Artificial Intelligence Review, 2023, 56(3): 1833-1865. [8] GONZALEZ R C. Deep convolutional neural networks[J]. IEEE Signal Processing Magazine, 2018, 35(6): 79-87. [9] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [10] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems, 2012, 25. [11] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1-9. [12] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014. [13] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231. [14] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015: 4489-4497. [15] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308. [16] HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6546-6555. [17] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2818-2826. [18] QIU Z, YAO T, MEI T. Learning spatio-temporal representation with Pseudo-3D residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5533-5541. [19] TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6450-6459. [20] XIE S, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 305-321. [21] TOSHPULATOV M, LEE W, LEE S, et al. DDC3N: Doppler-driven convolutional 3D network for human action recognition[J]. IEEE Access, 2024, 12: 93546-93567. [22] TRAN D, WANG H, TORRESANI L, et al. Video classification with channel-separated convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 5552-5561. [23] 易子文, 孙中华, 冯金超, 等. 用于行为识别的通道可分离卷积神经网络[J]. 信号处理, 2020, 36(9): 1497-1502. YI Z W, SUN Z H, FENG J C, et al. Channel separable convolutional neural network for action recognition[J]. Journal of Signal Processing, 2020, 36(9): 1497-1502. [24] 李秀智, 张冉, 贾松敏. 面向助老行为识别的三维卷积神经网络设计[J]. 北京工业大学学报, 2021, 47(6): 589-597. LI X Z, ZHANG R, JIA S M. Design of 3D convolutional neural network for action recognition for helping the aged[J]. Journal of Beijing University of Technology, 2021, 47(6): 589-597. [25] FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6202-6211. [26] FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 203-213. [27] LUO C, YUILLE A L. Grouped spatial-temporal aggregation for efficient action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 5512-5521. [28] ZHU M, BIN S, SUN G. Lite‐3DCNN combined with attention mechanism for complex human movement recognition[J]. Computational Intelligence and Neuroscience, 2022, 2022(1): 4816549. [29] ZHANG Z, TAKEDA M, IWATA M. Multi-pooling 3D convolutional neural network for fMRI classification of visual brain states[C]//Proceedings of the 2023 IEEE Conference on Artificial Intelligence (CAI), 2023: 118-119. [30] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. arXiv:1503.02531, 2015. [31] STROUD J, ROSS D, SUN C, et al. D3D: distilled 3D networks for video action recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020: 625-634. [32] ULLAH H, MUNIR A. A 3DCNN-based knowledge distillation framework for human activity recognition[J]. Journal of Imaging, 2023, 9(4): 82. [33] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7794-7803. [34] JIANG G, JIANG X, FANG Z, et al. An efficient attention module for 3D convolutional neural networks in action recognition[J]. Applied Intelligence, 2021,51: 7043-7057. [35] ULLAH H, MUNIR A. Human action representation learning using an attention-driven residual 3DCNN network[J]. Algorithms, 2023, 16(8): 369. [36] WANG Y, ZHU A, MA H, et al. 3D-ShuffleVit: an efficient video action recognition network with deep integration of self-attention and convolution[J]. Mathematics, 2023, 11(18): 3848. [37] ZHANG L, ZHU G, SHEN P, et al. Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017: 3120-3128. [38] LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 7083-7093. [39] WANG L M, LI WEI, LI WEN, et al. Appearance-and-relation networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 1430-1439. [40] HUANG, ZHEN, et al. 3D local convolutional neural networks for gait recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 14920-14929. [41] HU Y, SHUAI Z, YANG H, et al. ESDAR-Net: towards high-accuracy and real-time driver action recognition for embedded systems[J]. Multimedia Tools and Applications, 2024, 83(6): 18281-18307. [42] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2016: 20-36. [43] KORBAR B, TRAN D, TORRESANI L. SCSampler: sampling salient clips from video for efficient action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 6232-6242. [44] DESHPANDE A, WARHADE K. A robust human activity recognition system using 3D CNN[J]. International Journal of Computing and Digital Systems, 2023, 14(1): 10553-10563. [45] WANG P, CAO Y, SHEN C, et al. Temporal pyramid pooling-based convolutional neural network for action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 27(12): 2613-2622. [46] YANG C, XU Y, SHI J, et al. Temporal pyramid network for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 591-600. [47] XIAO F, LEE Y J, GRAUMAN K, et al. Audiovisual slowfast networks for video recognition[J]. arXiv:2001.08740, 2020. [48] 朱铮宇, 罗超, 贺前华, 等. 基于唇重构与三维耦合CNN的多视角音唇一致性判别[J]. 华南理工大学学报 (自然科学版), 2023, 51(5): 70-77. ZHU Z Y, LUO C, HE Q H, et al. Multi-view lip motion and voice consistency judgment based on lip reconstruction and three-dimensional coupled CNN[J]. Journal of South China University of Technology (Natural Science Edition) , 2023, 51(5): 70-77. [49] CHEN H, LI Y, FANG H, et al. Multi-scale attention 3D convolutional network for multimodal gesture recognition [J]. Sensors, 2022, 22(6): 2405. [50] KONDRATYUK D, YUAN L Z, LI Y D, et al. MoViNets: mobile video networks for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 16020-16030. [51] ZHOU Y, SUN X, ZHA Z J, et al. MiCT: mixed 3D/2D convolutional tube for human action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 449-458. [52] KUMAWAT S, VERMA M, NAKASHIMA Y, et al. Depthwise spatio-temporal STFT convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(9): 4839-4851. [53] 钱佳明, 娄文启, 宫磊, 等. 面向3D-CNN的算法压缩-硬件设计协同优化[J]. 计算机工程与应用, 2023, 59(18): 74-83. QIAN J M, LOU W Q, GONG L, et al. Algorithm compression and hardware design co-optimization for 3D-CNN[J]. Computer Engineering and Applications, 2023, 59(18): 74-83. [54] WANG T, LI J, ZHANG M, et al. An enhanced 3DCNN‐ConvLSTM for spatiotemporal multimedia data analysis[J]. Concurrency and Computation: Practice and Experience, 2021, 33(2): e5302. [55] REN P, XIAO G, CHANG X, et al. NAS-TC: neural architecture search on temporal convolutions for complex action recognition[J]. arXiv:2104.01110, 2021. [56] ZHANG S, GUO S, HUANG W, et al. V4D: 4D convolutional neural networks for video-level representation learning[J]. arXiv:2002.07442, 2020. [57] KIM J, CHA S, WEE D, et al. Regularization on spatio-temporally smoothed feature for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12103-12112. [58] SINGH D P, RAY L S S, ZHOU B, et al. A novel local-global feature fusion framework for body-weight exercise recognition with pressure mapping sensors[C]//Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024: 6375-6379. [59] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4694-4702. [60] CHANDA B, NYEEM H. Multilevel fusion with dual stream 3DCNN-LSTM for advancing dynamic hand gesture recognition[C]//Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), 2023: 1-6. [61] DEEPTHI G, NASEEBA B, PALVADI M. 3DCNN-GRU based video controller through hand gestures[J]. International Journal for Innovative Engineering & Management Research, Forthcoming, 2023. [62] LI Z, ZHANG Y, XING H, et al. Facial micro-expression recognition using double-stream 3D convolutional neural network with domain adaptation[J]. Sensors, 2023, 23(7): 3577. [63] 黄敏, 尚瑞欣, 钱惠敏. 面向视频中人体行为识别的复合型深度神经网络[J]. 模式识别与人工智能, 2022, 35(6): 562-570. HUANG M, SHANG R X, QIAN H M. Composite deep neural network for human activities recognition in video [J]. Pattern Recognition and Artificial Intelligence, 2022, 35(6): 562-570. [64] LIN Z H, HUANG S Y, WANG Y C F. Convolution in the cloud: learning deformable kernels in 3D graph convolution networks for point cloud analysis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1800-1809. [65] LI J, LIU X, ZONG Z, et al. Graph attention based proposal 3D convnets for action detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 4626-4633. [66] DAI R, DAS S, KAHATAPITIYA K, et al. MS-TCT: multi-scale temporal convtransformer for action detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20041-20051. [67] JASWAL G, SRIRANGARAJAN S, ROY S D. Range-Doppler hand gesture recognition using deep residual-3DCNN with transformer network[C]//Proceedings of the International Conference on Pattern Recognition. Cham: Springer, 2021: 759-772. [68] LE T H, LE T M, NGUYEN T A. Action identification with fusion of BERT and 3DCNN for smart home systems[J]. Internet of Things, 2023, 22: 100811. [69] KUMAWAT S, RAMAN S. LP-3DCNN: unveiling local phase in 3D convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4903-4912. [70] LI W, GAO Y, CHEN J, et al. Human gesture recognition based on millimeter-wave radar using improved C3D convolutional neural network[J]. Journal of Computers, 2023, 34(3): 1-18. [71] NGOC H N, XUAN N N, BUI T H, et al. An efficient approach for real-time abnormal human behavior recognition on surveillance cameras[C]//Proceedings of the 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), 2023: 1-6. [72] JIANG L, ZOU B, LIU S, et al. Recognition of abnormal human behavior in dual-channel convolutional 3D construction site based on deep learning[J]. Neural Computing and Applications, 2023, 35(12): 8733-8745. [73] ALAMEEN S A, ALHOTHALI A M. A lightweight driver drowsiness detection system using 3DCNN with LSTM[J]. Computer Systems Science & Engineering, 2023, 44(1): 895-912. [74] WANG Y, LI R, WANG Z, et al. E3D: an efficient 3D CNN for the recognition of dairy cow’s basic motion behavior[J]. Computers and Electronics in Agriculture, 2023, 205: 107607. [75] LI D, ZHANG K, LI Z, et al. A spatiotemporal convolutional network for multi-behavior recognition of pigs[J]. Sensors, 2020, 20(8): 2381. [76] WANG J H, HSU T H, LAI Y C, et al. Anomalous behavior recognition of underwater creatures using lite 3D full-convolution network[J]. Scientific Reports, 2023, 13(1): 20051. [77] LIANG C, LIANG Z. The application of deep convolution neural network in volleyball video behavior recognition[J]. IEEE Access, 2022, 10: 125908-125919. [78] SONG X, FAN L. Human posture recognition and estimation method based on 3D multiview basketball sports dataset[J]. Complexity, 2021, 2021: 1-10. [79] SINGH S P, WANG L, GUPTA S, et al. Shallow 3D CNN for detecting acute brain hemorrhage from medical imaging sensors[J]. IEEE Sensors Journal, 2020, 21(13): 14290-14299. [80] DE SALES CARVALHO N R, RODRIGUES M C L C, DE CARVALHO FILHO A O, et al. Automatic method for glaucoma diagnosis using a three-dimensional convoluted neural network[J]. Neurocomputing, 2021, 438: 72-83. [81] SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012. [82] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[J]. arXiv:1705.06950, 2017. [83] CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about kinetics-600[J]. arXiv:1808.01340, 2018. [84] CARREIRA J, NOLAND E, HILLIER C, et al. A short note on the kinetics-700 human action dataset[J]. arXiv:1907. 06987, 2019. [85] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the 2011 International Conference on Computer Vision, 2011: 2556-2563. [86] WAN J, LIN C, WEN L, et al. ChaLearn looking at people: IsogD and ConGD large-scale RGB-D gesture recognition[J]. IEEE Transactions on Cybernetics, 2020, 52(5): 3422-3433. [87] SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1010-1019. [88] KANG S M, WILDES R P. Review of action recognition and detection methods[J]. arXiv:1610. 06906, 2016. |
[1] | 李小童, 马素芬, 生慧, 魏国辉, 李欣桐. 基于深度学习的肺部CT图像病灶区域分割研究综述[J]. 计算机工程与应用, 2025, 61(4): 25-42. |
[2] | 董甲东, 郭庆虎, 陈琳, 桑飞虎. 深度学习中单阶段金属表面缺陷检测算法优化综述[J]. 计算机工程与应用, 2025, 61(4): 72-89. |
[3] | 雷景生, 章志豪, 钱小鸿, 王巍然, 杨胜英. 改进YOLOX的轻量级多方向车牌检测算法[J]. 计算机工程与应用, 2025, 61(4): 230-240. |
[4] | 张锴, 贾涛. 结合知识图谱和小目标改进的RCNN电力杆塔部件识别方法[J]. 计算机工程与应用, 2025, 61(4): 299-309. |
[5] | 蒋悦晗, 陈俊杰, 李洪均. 基于骨骼图神经网络的人体行为识别综述[J]. 计算机工程与应用, 2025, 61(3): 34-47. |
[6] | 李志媛, 刘祎, 张鹏程, 张丽媛, 任时磊, 芦婧, 桂志国. AWTV和高斯注意力引导的LDCT图像去噪网络[J]. 计算机工程与应用, 2025, 61(3): 253-263. |
[7] | 郝子强, 唐颖, 田芳, 张岩, 詹伟达. 轻量化的多尺度注意力脊柱侧弯筛查方法[J]. 计算机工程与应用, 2025, 61(3): 286-294. |
[8] | 高腾达, 任兆亭, 孙铁军, 吴春雷, 王雷全. 多分支加权的Transformer霍克斯过程[J]. 计算机工程与应用, 2025, 61(2): 191-199. |
[9] | 李润东, 曲英伟, 殷丽凤, 郑广海. YOLO-sea:改进YOLOv7-tiny的复杂海底目标检测算法研究[J]. 计算机工程与应用, 2025, 61(2): 247-258. |
[10] | 梁嘉杰, 李星星. 特定任务上下文解耦的遥感图像目标检测方法[J]. 计算机工程与应用, 2025, 61(2): 293-303. |
[11] | 袁瑞萍, 魏辉, 傅之家, 李俊韬. 融合CNN和WDF模型的电商企业商品销量预测研究[J]. 计算机工程与应用, 2025, 61(2): 335-343. |
[12] | 胡翔坤, 李华, 冯毅雄, 钱松荣, 李键, 李少波. 基于深度学习的基础设施表面裂纹检测方法研究进展[J]. 计算机工程与应用, 2025, 61(1): 1-23. |
[13] | 杨晓文, 冯泊栋, 韩慧妍, 况立群, 韩燮, 何黎刚. 结合通道剪枝和通道注意力的轻量型车辆点云补全网络[J]. 计算机工程与应用, 2025, 61(1): 232-242. |
[14] | 薛钦原, 胡珊珊, 胡新军, 严松才. 改进YOLOv7的结直肠息肉检测算法[J]. 计算机工程与应用, 2025, 61(1): 243-251. |
[15] | 孙刘杰, 王佳耀, 王文举. 基于单幅图像形状特征的三维漫画人脸重建[J]. 计算机工程与应用, 2025, 61(1): 282-290. |
阅读次数 | ||||||||||||||||||||||||||||||||||||||||||||||
全文 80
|
|
|||||||||||||||||||||||||||||||||||||||||||||
摘要 66
|
|
|||||||||||||||||||||||||||||||||||||||||||||