Driver Behavior Recognition Method Using Dual-Sequence Pose Integration

doi:10.3778/j.issn.1002-8331.2408-0410

Abstract

Abstract: Identifying dangerous driving behavior patterns can enhance driving safety and is a crucial aspect of autonomous driving technology research. Currently, image-based driver behavior recognition methods face challenges such as high computational costs and information redundancy. To address these issues, a novel driver behavior recognition method called SimPoseConv3D is proposed, which integrates dual-sequence posture information. Firstly, the SimCC module extracts driver pose heatmap sequences from video. These heatmaps are then stacked, cropped, and sampled along the temporal dimension. Subsequently, the heatmap volumes are fused in both forward and backward directions along the time axis before being input into a 3D CNN to extract spatiotemporal features for behavior recognition. Training and testing on the Drive&Act dataset, along with ablation experiments, show that the proposed method achieves recognition accuracies of 70.25% and 79.04% on Task-level (overall behavior) and Mid-level (fine-grained behavior) test sets, respectively, representing improvements of 6.07 and 4.13 percentage points over the current best public methods. Additionally, using SimCC as the pose estimator enhances computational efficiency by 18.51% compared to traditional pose estimators.

Key words: driver behavior recognition, human pose estimation, bi-directional pose heatmap sequences

摘要： 识别危险驾驶行为模式可以提高驾驶安全，是自动驾驶技术重要研究内容。目前，基于图像的驾驶员行为识别方法存在计算量大、信息冗余等问题，由此提出融合双序列姿态的驾驶员行为识别方法SimPoseConv3D。基于人体姿态序列估计模块SimCC从视频中提取驾驶员姿态热图序列，在时间维度上进行堆叠、裁剪和采样，将热图体积按时间维度进行正向、逆向融合，输入至3D CNN中提取动作时空特征进行驾驶行为识别。在Drive&Act数据集中对提出方法进行训练测试并开展消融实验，结果表明在Task-level（整体行为）和Mid-level（细粒度行为）测试集上的识别精度分别达到70.25%和79.04%，相比当前公开最佳方法分别提升6.07和4.13个百分点，且采用SimCC作为姿态估计器比传统姿态估计器的计算效率提升18.51%。

关键词: 驾驶员行为识别, 人体姿态估计, 双向姿态热图序列

TAN Dayi, TIAN Wei, XIONG Lu. Driver Behavior Recognition Method Using Dual-Sequence Pose Integration[J]. Computer Engineering and Applications, 2025, 61(23): 126-134.

谭大艺, 田炜, 熊璐. 融合双序列姿态的驾驶员行为识别方法[J]. 计算机工程与应用, 2025, 61(23): 126-134.

References

[1] HEALEY J A, PICARD R W. Detecting stress during real-world driving tasks using physiological sensors[J]. IEEE Transactions on Intelligent Transportation Systems, 2005, 6(2): 156-166.
[2] LI W, TAN R, XING Y, et al. A multimodal psychological, physiological and behavioural dataset for human emotions in driving tasks[J]. Scientific Data, 2022, 9(1): 481.
[3] RAMANISHKA V, CHEN Y T, MISU T, et al. Toward driving scene understanding: a dataset for learning driver behavior and causal reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7699-7707.
[4] MARTIN M, ROITBERG A, HAURILET M, et al. Drive & Act: a multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 2801-2810.
[5] YANG C, YANG Z Y, LI W Y, et al. FatigueView: a multi-camera video dataset for vision-based drowsiness detection[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24(1): 233-246.
[6] ROTH M, GAVRILA D M. DD-pose-a large-scale driver head pose benchmark[C]//Proceedings of the IEEE Intelligent Vehicles Symposium. Piscataway: IEEE, 2019: 927-934.
[7] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 4489-4497.
[8] QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5534-5542.
[9] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733.
[10] WHARTON Z, BEHERA A, LIU Y H, et al. Coarse temporal attention network (CTA-Net) for driver’s activity recognition[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2021: 1278-1288.
[11] LIU D C, YAMASAKI T, WANG Y, et al. Toward extremely lightweight distracted driver recognition with distillation-based neural architecture search and knowledge transfer[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24(1): 764-777.
[12] GAO C Q, DU Y H, LIU J, et al. InfAR dataset: infrared action recognition at different times[J]. Neurocomputing, 2016, 212: 36-47.
[13] JIANG Z L, ROZGIC V, ADALI S. Learning spatiotemporal features for infrared action recognition with 3D convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2017: 309-317.
[14] VENTURELLI M, BORGHI G, VEZZANI R, et al. From depth data to head pose estimation: a siamese approach[J]. arXiv:1703.03624, 2017.
[15] BORGHI G, FABBRI M, VEZZANI R, et al. Face-from-depth for head pose estimation on depth images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(3): 596-609.
[16] ISOLA P, ZHU J Y, ZHOU T H, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5967-5976.
[17] REN H Z, GUO Y G, BAI Z H, et al. A multi-semantic driver behavior recognition model of autonomous vehicles using confidence fusion mechanism[J]. Actuators, 2021, 10(9): 218.
[18] KONSTANTINOU M, RETSINAS G, MARAGOS P. Enhancing action recognition in vehicle environments with human pose information[C]//Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments. New York: ACM, 2023: 197-205.
[19] TAN D Y, CHEN H S, TIAN W, et al. DiffusionRegPose: enhancing multi-person pose estimation using a diffusion-based end-to-end regression approach[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 2230-2239.
[20] YANG J, ZENG A, LIU S, et al. Explicit box detection unifies end-to-end multi-person pose estimation[J]. arXiv:2302.01593, 2023.
[21] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5686-5696.
[22] CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5385-5394.
[23] LI Y J, YANG S, LIU P D, et al. SimCC: a simple coordinate classification perspective forHuman pose estimation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 89-106.
[24] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1302-1310.
[25] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[26] 高德勇, 陈泰达, 缪兰. 改进YOLOv8n的道路目标检测算法[J]. 计算机工程与应用, 2024, 60(16): 186-197.
GAO D Y, CHEN T D, MIAO L. Improved road object detection algorithm for YOLOv8n[J]. Computer Engineering and Applications, 2024, 60(16): 186-197.
[27] 吴兆东, 徐成, 刘宏哲, 等. 适用于鱼眼图像的改进YOLOv7目标检测算法[J]. 计算机工程与应用, 2024, 60(14): 250-256.
WU Z D, XU C, LIU H Z, et al. Improved YOLOv7 object detection algorithm for fisheye images[J]. Computer Engineering and Applications, 2024, 60(14): 250-256.
[28] 胡宏宇, 黎烨宸, 张争光, 等. 基于多尺度骨架图和局部视觉上下文融合的驾驶员行为识别方法[J]. 汽车工程, 2024, 46(1): 1-8.
HU H Y, LI Y C, ZHANG Z G, et al. Driver behavior recognition based on multi-scale skeleton graph and local visual context method[J]. Automotive Engineering, 2024, 46(1): 1-8.
[29] LI P, LU M Q, ZHANG Z W, et al. A novel spatial-temporal graph for skeleton-based driver action recognition[C]//Proceedings of the IEEE Intelligent Transportation Systems Conference. Piscataway: IEEE, 2019: 3243-3248.
[30] YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018:7444-7452.
[31] SONG Y F, ZHANG Z, SHAN C F, et al. Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1625-1633.
[32] WANG H S, WANG L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 3633-3642.
[33] MARTIN M, POPP J, ANNEKEN M, et al. Body pose and context information for driver secondary task detection[C]//Proceedings of the IEEE Intelligent Vehicles Symposium. Piscataway: IEEE, 2018: 2015-2021.
[34] REISS S, ROITBERG A, HAURILET M, et al. Deep classification-driven domain adaptation for cross-modal driver behavior recognition[C]//Proceedings of the IEEE Intelligent Vehicles Symposium. Piscataway: IEEE, 2020: 1042-1047.