计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (16): 160-170.DOI: 10.3778/j.issn.1002-8331.2405-0224

• 模式识别与人工智能 • 上一篇    下一篇

联合跨阶段信息的轻量化人体姿态估计

陈相龙,李松洋,陈恩庆,郭新,汪松   

  1. 郑州大学 电气与信息工程学院,郑州 450001
  • 出版日期:2025-08-15 发布日期:2025-08-15

Lightweight Human Pose Estimation with Joint Cross-Stage Information Fusion

CHEN Xianglong, LI Songyang, CHEN Enqing, GUO Xin, WANG Song   

  1. School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China
  • Online:2025-08-15 Published:2025-08-15

摘要: 人体姿态估计任务在实际应用中往往要求网络模型既有较高的估计精度又能快速高效地实现,因此在模型设计时需要兼顾准确率和实时性,并使模型能在资源有限的边缘设备上部署运行。然而目前轻量化人体姿态估计模型存在随着计算复杂度的降低而精度显著下降的问题,无法兼顾准确率和速度,因而在边缘设备中难以部署。为解决这一问题,采用轻量级主干网络EMO(efficient model),设计一种跨阶段注意力机制,通过借鉴跨阶段特征分布信息引导不同尺度特征对齐,并根据不同阶段特征的权重设计简单有效的多层特征融合方法。引入特征融合监督损失函数,直接优化多尺度特征融合过程,并对特征解码器进行轻量化改进,使模型在速度与精度上达到平衡。在COCO与MPII数据集上的测试结果表明,与基准模型相比,模型在降低复杂度的情况下达到了更优的准确率,并且优于主流轻量化模型。

关键词: 人体姿态估计, 多尺度特征融合, 轻量化网络, 特征融合监督

Abstract: The task of human pose estimation in practical applications often requires a network model to achieve both high estimation accuracy and efficient implementation. Therefore, it is essential to consider both accuracy and real-time performance in model design, enabling deployment and execution on resource-constrained edge devices. However, current lightweight human pose estimation models suffer from a significant decrease in accuracy as computational complexity decreases, making it difficult to achieve a balance between accuracy and speed, and thus challenging to deploy on edge devices. To address this issue, this paper adopts the lightweight backbone network EMO (efficient model) and designs a cross-stage attention mechanism. It guides the alignment of features at different scales by leveraging cross-stage feature distribution information, and designs a simple and effective multi-stage feature fusion based on the weights of features at different stages. In addition, a feature fusion supervised loss function is introduced to directly optimize the multi-scale feature fusion process, and lightweight improvements are made to the feature decoder, enabling the model to achieve a balance between speed and accuracy. The test results on the COCO and MPII datasets indicate that, compared to the baseline, the proposed model achieves superior accuracy under reduced complexity, outperforming mainstream lightweight models.

Key words: human pose estimation, multi-scale feature fusion, lightweight network, feature fusion supervision