Combining Dynamic Split Convolutions and Attention for Multi-Scale Human Pose Estimation

doi:10.3778/j.issn.1002-8331.2307-0301

Abstract

Abstract: Human pose estimation has become increasingly important in many fields such as animation design, security monitoring, and motion analysis. However, current human pose estimation algorithms focus on accuracy, leading to complex networks with high computational costs, making it difficult to apply them on mobile devices and embedded platforms. To address this challenge, this paper proposes the DNSNet, a multi-scale human pose estimation network that combines dynamic split convolution and normalized attention. Firstly, the bottleneck layer DKASCneck of the high-resolution network is redesigned using dynamic split convolution and dynamic kernel aggregation operations. This avoids excessive use of large convolution kernels, reduces computational costs while enhancing the ability of the network to extract useful features. Secondly, the NAMPCblock, a basic module using partial convolution and normalization-based attention mechanism, is introduced. This module reduces computational redundancy and memory access while enhancing information interaction across channels and spatial dimensions. Finally, the output feature fusion method of the network is redesigned based on multi-resolution features and deconvolution to improve the accuracy of heatmap regression predictions. Experimental results show that compared to high-resolution networks, on the COCO validation set, the average accuracy of the proposed network model is increased by 2.1 percentage points, the computational complexity is reduced by 32.4% and the model parameters are reduced by 71.9%. On the MPII validation set, the computational complexity is reduced by 38.9%, and the model parameters are reduced by 71.9%. The experimental data demonstrate that the proposed network significantly reduces network complexity while slightly improving detection accuracy.

Key words: human pose estimation, high-resolution network, multi-scale

摘要： 人体姿态估计在动画设计、安防监控、运动分析等领域的重要性不断增加，然而目前的人体姿态估计算法注重准确率，导致网络复杂且计算成本高，难以应用在移动设备和嵌入式平台上。为了缓解这一难题，提出结合动态分裂卷积和归一化注意力的多尺度人体姿态估计网络DNSNet。使用动态分裂卷积与动态内核聚合操作，重新设计了高分辨率网络的瓶颈层DKASCneck，避免过多使用大卷积核，在降低计算成本的同时增强了网络对有用特征的提取能力；提出了部分卷积与基于归一化的注意力机制的基础模块NAMPCblock，在减少计算冗余和内存访问的同时保留了通道和空间方面的信息增强跨纬度交互；以多分辨率特征与反卷积为基础进行网络输出特征融合方式的重新设计，提升网络的热图回归预测准确率。实验结果表明，相对于高分辨网络，所提出的网络模型在COCO验证集上平均准确率提升了2.1个百分点，同时运算复杂度减少了32.4%，模型参数量降低了71.9%。在MPII验证集上，运算复杂度降低了38.9%，模型参数量降低了71.9%。实验数据显示，所提出的网络可以大幅度降低网络复杂度，同时可以小幅提升检测精度。

关键词: 人体姿态估计, 高分辨网络, 多尺度

FENG Mingwen, XU Yang, ZHANG Yongdan, XIAO Ci, HUANG Yiqian. Combining Dynamic Split Convolutions and Attention for Multi-Scale Human Pose Estimation[J]. Computer Engineering and Applications, 2024, 60(22): 219-229.

冯明文, 徐杨, 张永丹, 肖慈, 黄易仟. 结合动态分裂卷积和注意力的多尺度人体姿态估计[J]. 计算机工程与应用, 2024, 60(22): 219-229.

References

[1] 张国平, 马楠, 贯怀光, 等. 深度学习方法在二维人体姿态估计的研究进展[J]. 计算机科学, 2022, 49(12): 219-228.
ZHANG G P, MA L, GUAN H G, et al. Research progress of deep learning methods in two-dimensional human pose estimation[J]. Computer Science, 2022, 49(12): 219-228.
[2] MEHTA D, SRIDHAR S, SOTNYCHENKO O, et al. VNECT: real-time 3D human pose estimation with a single RGB camera[J]. ACM Transactions on Graphics, 2017, 36(4): 1-14.
[3] RAFI U, DOERING A, LEIBE B, et al. Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 36-52.
[4] RHODIN H, SP?RRI J, KATIRCIOGLU I, et al. Learning monocular 3D human pose estimation from multi-view images[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8437-8446.
[5] DAS S, SHARMA S, DAI R, et al. VPN: learning video-pose embedding for activities of daily living[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 72-90.
[6] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1653-1660.
[7] NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 483-499.
[8] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7103-7112.
[9] SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4510-4520.
[10] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 7291-7299.
[11] ZHANG Z, TANG J, WU G. Simple and lightweight human pose estimation[J]. arXiv:1911.10346, 2019.
[12] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5693-5703.
[13] CHENG B, XIAO B, WANG J, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 5386-5395.
[14] YU C, XIAO B, GAO C, et al. Lite-HRNet: a lightweight high-resolution network[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 10440-10450.
[15] LI Q, ZHANG Z, XIAO F, et al. Dite-HRNet: dynamic lightweight high-resolution network for human pose estimation[J]. arXiv:2204.10762, 2022.
[16] CHEN J, KAO S, HE H, et al. Run, don’t walk: chasing higher FLOPS for faster neural networks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 12021-12031.
[17] 邓辉, 徐杨. 融入注意力和密集连接的轻量型人体姿态估计[J]. 计算机工程与应用, 2022, 58(16): 265-273.
DENG H, XU Y. Lightweight human pose estimation based on attention and dense connection[J]. Computer Engineering and Applications, 2022, 58(16): 265-273.
[18] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[19] FU J, LIU J, TIAN H, et al. Dual attention network for scene segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3146-3154.
[20] LIU Y, SHAO Z, TENG Y, et al. NAM: normalization-based attention module[J]. arXiv:2111.12419, 2021.
[21] YANG B, BENDER G, LE Q V, et al. CondConv: conditionally parameterized convolutions for efficient inference[C]//Advances in Neural Information Processing Systems 32, 2019.
[22] CHEN Y, DAI X, LIU M, et al. Dynamic convolution: attention over convolution kernels[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11030-11039.
[23] ZHANG Y, ZHANJ, WANG Q, et al. DyNet: dynamic convolution for accelerating convolutional neural networks[J]. arXiv:2004.10694, 2020.
[24] LI C, ZHOU A, YAO A. Omni-dimensional dynamic convolution[J]. arXiv:2209.07947, 2022.
[25] 李杰. 结合注意力和纹理特征增强的行人再识别[J]. 计算机科学与探索, 2022, 16(3): 661-668.
LI J. Attention and texture feature enhancement for person re-identification[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 661-668.
[26] 高坤, 李汪根, 束阳, 等. 融入密集连接的多尺度轻量级人体姿态估计[J]. 计算机工程与应用, 2022, 58(24): 196-204.
GAO K, LI W G, SU Y, et al. Multi-scale lightweight human pose estimation with dense connections[J]. Computer Engineering and Applications, 2022, 58(24): 196-204.
[27] 张富凯, 贺天成. 结合轻量Openpose和注意力引导图卷积的动作识别[J]. 计算机工程与应用, 2022, 58(18): 180-187.
ZHANG F K, HE T C. Action recognition combined with lightweight Openpose and attention-guided graph convolution[J]. Computer Engineering and Applications, 2022, 58(18): 180-187.
[28] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 3-19.
[29] FANG H S, XIE S, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 2334-2343.
[30] 钟宝荣, 吴夏灵. 基于高分辨率网络的轻量型人体姿态估计研究[J]. 计算机工程, 2023, 49(4): 226-232.
ZHONG B R, WU X L. Research on lightweight humanpose estimation based on high-resolution network[J]. Computer Engineering, 2023, 49(4): 226-232.
[31] 王仕宸, 黄凯, 陈志刚, 等. 深度学习的三维人体姿态估计综述[J]. 计算机科学与探索, 2023, 17(1): 74-87.
WANG S C, HUANG K, CHENG Z G, et al. Survey on 3D human pose estimation of deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 74-87.
[32] 何坚, 郭泽龙, 刘乐园, 等. 基于滑动窗口和卷积神经网络的可穿戴人体活动识别技术[J]. 电子与信息学报, 2022, 44(1): 168-177.
HE J, GUO Z L, LIU L Y, et al. Human activity recognition technology based on sliding window and convolutional neural network[J]. Journal of Electronics and Information Technology, 2022, 44(1): 168-177.
[33] XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 466-481.
[34] HUANG J, ZHU Z, GUO F, et al. The devil is in the details: delving into unbiased data processing for human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 5700-5709.
[35] ZHANG F, ZHU X, DAI H, et al. Distribution-aware coordinate representation for human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 7093-7102.
[36] TANG W, YU P, WU Y. Deeply learned compositional models for human pose estimation[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 190-206.