计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (15): 218-228.DOI: 10.3778/j.issn.1002-8331.2405-0028

• 模式识别与人工智能 • 上一篇    下一篇

渐进式特征融合的高效轻量级人体姿态估计

肖俊,赵骥   

  1. 辽宁科技大学 计算机与软件工程学院,辽宁 鞍山 114051
  • 出版日期:2025-08-01 发布日期:2025-07-31

Efficient Lightweight Human Pose Estimation with Progressive Feature Fusion

XIAO Jun, ZHAO Ji   

  1. School of Computer and Software Engineering, Liaoning University of Science and Technology, Anshan, Liaoning 114051, China
  • Online:2025-08-01 Published:2025-07-31

摘要: 针对当前人体姿态估计模型在追求检测性能时所面临的巨大计算负载问题,以高分辨率网络为基线提出高效渐进式特征融合网络(efficient progressive feature fusion network,EPFFNet)。移除了高分辨率网络的第四阶段,达到平衡模型复杂度和检测能力的目的。构建了一种混合高效通道注意力,能够同时关注全局和局部通道特征信息,并在此基础上提出了一种新型注意力特征融合模块,实现特征之间的强融合。设计了一种高效坐标注意力,增强模型对不同区域特征信息的关注,同时与多样归一化和其他轻量化模块相结合提出了GENeck和GEBlock模块,替换高分辨率网络的残差块,保证了模型性能并实现了模型轻量化。设计了一种渐进式特征交融模块,增强模型对不同分辨率特征的融合能力。实验结果表明,EPFFNet在COCO和MPII数据集上分别取得了75.1%和90.8%的准确率,不仅在轻量化模型中实现了最优精度,而且与大模型相比具有相当或更佳的性能。

关键词: 计算机视觉, 人体姿态估计, 轻量化, 高分辨率网络, 注意力机制

Abstract: In response to the significant computational load faced by current human pose estimation models in pursuit of detection performance, this paper proposes efficient progressive feature fusion network(EPFFNet) based on a high-resolution network as the baseline. The fourth stage of the high-resolution network is removed to achieve a balance between model complexity and detection capability. A hybrid efficient channel attention mechanism is constructed, which can simultaneously focus on both global and local channel feature information. Based on this, a novel attention feature fusion module is proposed to achieve strong fusion between features. Furthermore, an efficient coordinate attention mechanism is designed to enhance the model’s attention to feature information in different regions. Additionally, combined with various normalization and other lightweight modules, GENeck and GEBlock modules are proposed to replace the residual blocks of the high-resolution network, ensuring model performance while achieving model lightweighting. A progressive feature fusion module is designed to enhance the model’s ability to fuse features of different resolutions. The experimental results indicate that EPFFNet achieves accuracy rates of 75.1% and 90.8% on the COCO and MPII datasets respectively. EPFFNet not only achieves the best accuracy among lightweight models but also demonstrates comparable or better performance than larger models.

Key words: computer vision, human pose estimation, lightweight, high-resolution network, attention mechanism