计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (6): 199-209.DOI: 10.3778/j.issn.1002-8331.2310-0407

• 模式识别与人工智能 • 上一篇    下一篇

多尺度和多层级特征融合的人体姿态估计

王燕妮,胡敏,韩世鹏,陈艺瑄,吕昊   

  1. 1.西安建筑科技大学 信息与控制工程学院,西安 710055
    2.空军军医大学 军事生物医学工程学系,西安 710032
  • 出版日期:2025-03-15 发布日期:2025-03-14

Human Pose Estimation with Multi-Scale and Multi-Level Feature Fusion

WANG Yanni, HU Min, HAN Shipeng, CHEN Yixuan, LYU Hao   

  1. 1.School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China
    2.Department of Military Biomedical Engineering, Air Force Medical University of PLA, Xi’an 710032, China
  • Online:2025-03-15 Published:2025-03-14

摘要: 人体姿态估计的精度提升通常依赖于特征融合,但是现有特征融合策略往往忽略了尺度特征和层级特征之间的交互作用。为了充分利用不同特征之间的互补性,提出了一种新特征融合策略用以提升人体姿态估计精度,即多尺度和多层级特征融合网络(multi-scale and multi-level network,MSLNet)。采用高分辨率网络(high-resolution network,HRNet)作为主干,通过跨尺度信息交互,实现不同分辨率特征图之间的信息交换,获取同时包含细粒度和粗粒度的姿态特征;引入期望最大化注意力-加权双向特征金字塔网络(expectation maximization attention-bidirectional feature pyramid network,EMA-BiFPN),实现多尺度特征融合后的多层级特征聚合,从局部到全局捕捉人体姿态的细节和关联信息;设计由残差结构组成的关键点检测头,完成输出特征的最终融合并提升人体关键点检测准确率。实验结果表明,MSLNet在COCO和MPII数据集上分别取得了75.8%和91.1%的准确率,实现了最优精度,充分验证了MSLNet能够融合尺度和层级之间的互补特征,进而提升人体姿态估计精度。

关键词: 高分辨率网络(HRNet), 人体姿态估计, 期望最大化注意力, 双向特征金字塔网络, 特征融合

Abstract: The accuracy improvement of human pose estimation usually depends on feature fusion. However, the existing feature fusion strategies often ignore the interaction between scale features and level features. The fusion of single mode may result in less significant feature expression. To make full use of the complementarity between different features, a new multi-scale and multi-level feature fusion network (MSLNet) is proposed. The high-resolution network (HRNet) is used as the backbone to exchange information between feature maps of different resolutions through cross-scale information exchange, and to obtain both fine-grained and coarse-grained pose features. The expectation maximization attention bidirectional feature pyramid network (EMA-BiFPN) is introduced to achieve multi-level feature aggregation after multi-scale feature fusion. The details and correlation information of human pose are captured from local to global. A keypoint detection head composed of residual structure is designed to complete the final fusion of output features and improve the accuracy of human keypoint detection. The experimental results show that MSLNet achieves the best accuracy of 75.8% and 91.1% on COCO and MPII datasets, respectively. It is fully verified that MSLNet can make use of the complementarity between scale features and level features to improve the accuracy of human pose estimation.

Key words: high-resolution network (HRNet), human pose estimation, expectation maximization attention, bidirectional feature pyramid network, feature fusion