计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (13): 219-227.DOI: 10.3778/j.issn.1002-8331.2307-0385

• 图形图像处理 • 上一篇    下一篇

引入级联通道注意力的轻量化人体姿态估计

林远强,郜辉,王鹏,吕志刚,李晓艳,王储   

  1. 1.西安工业大学 电子信息工程学院,西安 710021
    2.西安工业大学 发展规划处,西安 710021
  • 出版日期:2024-07-01 发布日期:2024-07-01

Lightweight Human Pose Estimation with Cascaded Channel Attention

LIN Yuanqiang, GAO Hui, WANG Peng, LYU Zhigang, LI Xiaoyan, WANG Chu   

  1. 1.School of Electronics and Information Engineering, Xi’an Technological University, Xi’an 710021, China
    2.Development Planning Department, Xi’an Technological University, Xi’an 710021, China
  • Online:2024-07-01 Published:2024-07-01

摘要: 针对当前人体姿态估计模型在轻量化过程中精度损失严重的问题,以高分辨率网络(HRNet)为基线提出一种引入级联通道注意力的轻量化人体姿态估计模型。构建一种保持内部高分辨率特征的级联通道注意力,学习输入特征各通道的重要性来提高模型表征能力;通过设计一种基于MetaFormer结构的轻量级深度卷积变换模块来替换HRNet阶段2、3、4中运算复杂度较高的残差模块;设计一种多尺度特征融合方法减少HRNet原融合方法中的多维特征语义信息损失;采用无偏数据处理来消除关键点热力图编码过程中导致的偏移误差。COCO2017验证集的实验结果表明,所提出的模型同基准模型相比,在AP降低2个百分点的情况下,模型参数量和浮点运算量分别减少了90.2%和83.1%,并且以AP为71.4%的表现在轻量化模型中达到精度最优。

关键词: 人体姿态估计, 轻量化, 通道注意力, MetaFormer结构, 多尺度特征融合

Abstract: Aiming at the problem of serious loss of accuracy in the lightweighting process of the current human pose estimation model, a lightweight human pose estimation model that introduces cascaded channel attention is proposed using the high resolution network (HRNet) as a baseline. Firstly, a cascading channel attention that maintains internal high-resolution features is constructed so as to learn the importance of each channel of the input features to improve the model representation. Secondly, the residual module with high arithmetic complexity in HRNet stages 2, 3, and 4 is replaced by designing a lightweight deepwise convolutional transform module based on the structure of the MetaFormer. Furthermore, a multi-scale feature fusion method is designed to reduce the loss of semantic information of multi-dimensional features in the original fusion method of HRNet. Finally, unbiased data processing is used to eliminate offset errors caused by the process of encoding the heat map at key points. Experimental results from the COCO2017 validation set show that the proposed model reduces the number of model parameters and floating-point operations by 90.2% and 83.1%, respectively, compared to the benchmark model with a 2 percentage points decrease in AP, and achieves the optimal accuracy among the lightweight models with an AP of 71.4%.

Key words: human pose estimation, lightweight, channel attention, MetaFormer structure, multi-scale feature fusion