Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (9): 1-18.DOI: 10.3778/j.issn.1002-8331.2310-0090

• Research Hotspots and Reviews • Previous Articles     Next Articles

Review on Human Action Recognition Methods Based on Multimodal Data

WANG Cailing, YAN Jingjing, ZHANG Zhidong   

  1. School of Computer Science, Xi’an Shiyou University, Xi’an 710065, China
  • Online:2024-05-01 Published:2024-04-29

基于多模态数据的人体行为识别方法研究综述

王彩玲,闫晶晶,张智栋   

  1. 西安石油大学 计算机学院,西安 710065

Abstract: Human action recognition (HAR) is widely applied in the fields of intelligent security, autonomous driving and human-computer interaction. With advances in capture equipment and sensor technology, the data that can be acquired for HAR is no longer limited to RGB data, but also multimodal data such as depth, skeleton, and infrared data. Feature extraction methods in HAR based on RGB and skeleton data modalities are introduced in detail, including handcrafted-based and deep learning-based methods. For RGB data modalities, feature extraction algorithms based on two-stream convolutional neural network (2s-CNN), 3D convolutional neural network (3DCNN) and hybrid network are analyzed. For skeleton data modalities, some popular pose estimation algorithms for single and multi-person are firstly introduced. The classification algorithms based on convolutional neural network (CNN), recurrent neural network (RNN), and graph convolutional neural network (GCN) are analyzed stressfully. A further comprehensive demonstration of the common datasets for both data modalities is presented. In addition, the current challenges are explored based on the corresponding data structure features of RGB and skeleton. Finally, future research directions for deep learning-based HAR methods are discussed.

Key words: video understanding, human action recognition, deep learning, feature extraction, pose estimation algorithms

摘要: 人体行为识别广泛应用于智能安防、自动驾驶和人机交互等领域。随着拍摄设备和传感器技术的发展,可获取用于人体行为识别的数据不再局限于RGB数据,还有深度、骨骼和红外等多模态数据。详细介绍了基于RGB和骨骼数据模态的人体行为识别任务中特征提取方法,包括基于手工标注和基于深度学习的方法。对于RGB数据模态,重点分析了基于双流卷积神经网络、3D卷积神经网络和混合网络的特征提取算法。对于骨骼数据模态,介绍了目前流行的单人和多人姿态评估算法;重点分析了基于卷积神经网络、循环神经网络和图卷积神经网络的分类算法;进一步全面展示了两种数据模态的通用数据集。此外,基于RGB和骨骼各自的数据结构特征,探讨了目前面临的挑战,最后对未来基于深度学习的人体行为识别方法的研究方向进行了展望。

关键词: 视频理解, 人体行为识别, 深度学习, 特征提取, 姿态评估算法