计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (22): 219-229.DOI: 10.3778/j.issn.1002-8331.2307-0301

• 图形图像处理 • 上一篇    下一篇

结合动态分裂卷积和注意力的多尺度人体姿态估计

冯明文,徐杨,张永丹,肖慈,黄易仟   

  1. 1.贵州大学 大数据与信息工程学院,贵阳 550025
    2.贵阳铝镁设计研究院有限公司,贵阳 550009
  • 出版日期:2024-11-15 发布日期:2024-11-14

Combining Dynamic Split Convolutions and Attention for Multi-Scale Human Pose Estimation

FENG Mingwen, XU Yang, ZHANG Yongdan, XIAO Ci, HUANG Yiqian   

  1. 1.College of Big Data and Information Engineering, Guizhou University, Guiyang 550025, China
    2.Guiyang Aluminum-Magnesium Design and Research Institute Co., Ltd., Guiyang 550009, China
  • Online:2024-11-15 Published:2024-11-14

摘要: 人体姿态估计在动画设计、安防监控、运动分析等领域的重要性不断增加,然而目前的人体姿态估计算法注重准确率,导致网络复杂且计算成本高,难以应用在移动设备和嵌入式平台上。为了缓解这一难题,提出结合动态分裂卷积和归一化注意力的多尺度人体姿态估计网络DNSNet。使用动态分裂卷积与动态内核聚合操作,重新设计了高分辨率网络的瓶颈层DKASCneck,避免过多使用大卷积核,在降低计算成本的同时增强了网络对有用特征的提取能力;提出了部分卷积与基于归一化的注意力机制的基础模块NAMPCblock,在减少计算冗余和内存访问的同时保留了通道和空间方面的信息增强跨纬度交互;以多分辨率特征与反卷积为基础进行网络输出特征融合方式的重新设计,提升网络的热图回归预测准确率。实验结果表明,相对于高分辨网络,所提出的网络模型在COCO验证集上平均准确率提升了2.1个百分点,同时运算复杂度减少了32.4%,模型参数量降低了71.9%。在MPII验证集上,运算复杂度降低了38.9%,模型参数量降低了71.9%。实验数据显示,所提出的网络可以大幅度降低网络复杂度,同时可以小幅提升检测精度。

关键词: 人体姿态估计, 高分辨网络, 多尺度

Abstract: Human pose estimation has become increasingly important in many fields such as animation design, security monitoring, and motion analysis. However, current human pose estimation algorithms focus on accuracy, leading to complex networks with high computational costs, making it difficult to apply them on mobile devices and embedded platforms. To address this challenge, this paper proposes the DNSNet, a multi-scale human pose estimation network that combines dynamic split convolution and normalized attention. Firstly, the bottleneck layer DKASCneck of the high-resolution network is redesigned using dynamic split convolution and dynamic kernel aggregation operations. This avoids excessive use of large convolution kernels, reduces computational costs while enhancing the ability of the network to extract useful features. Secondly, the NAMPCblock, a basic module using partial convolution and normalization-based attention mechanism, is introduced. This module reduces computational redundancy and memory access while enhancing information interaction across channels and spatial dimensions. Finally, the output feature fusion method of the network is redesigned based on multi-resolution features and deconvolution to improve the accuracy of heatmap regression predictions. Experimental results show that compared to high-resolution networks, on the COCO validation set, the average accuracy of the proposed network model is increased by 2.1 percentage points, the computational complexity is reduced by 32.4% and the model parameters are reduced by 71.9%. On the MPII validation set, the computational complexity is reduced by 38.9%, and the model parameters are reduced by 71.9%. The experimental data demonstrate that the proposed network significantly reduces network complexity while slightly improving detection accuracy.

Key words: human pose estimation, high-resolution network, multi-scale