Lightweight Human Pose Estimation with Joint Cross-Stage Information Fusion

doi:10.3778/j.issn.1002-8331.2405-0224

Abstract

Abstract: The task of human pose estimation in practical applications often requires a network model to achieve both high estimation accuracy and efficient implementation. Therefore, it is essential to consider both accuracy and real-time performance in model design, enabling deployment and execution on resource-constrained edge devices. However, current lightweight human pose estimation models suffer from a significant decrease in accuracy as computational complexity decreases, making it difficult to achieve a balance between accuracy and speed, and thus challenging to deploy on edge devices. To address this issue, this paper adopts the lightweight backbone network EMO (efficient model) and designs a cross-stage attention mechanism. It guides the alignment of features at different scales by leveraging cross-stage feature distribution information, and designs a simple and effective multi-stage feature fusion based on the weights of features at different stages. In addition, a feature fusion supervised loss function is introduced to directly optimize the multi-scale feature fusion process, and lightweight improvements are made to the feature decoder, enabling the model to achieve a balance between speed and accuracy. The test results on the COCO and MPII datasets indicate that, compared to the baseline, the proposed model achieves superior accuracy under reduced complexity, outperforming mainstream lightweight models.

Key words: human pose estimation, multi-scale feature fusion, lightweight network, feature fusion supervision

摘要： 人体姿态估计任务在实际应用中往往要求网络模型既有较高的估计精度又能快速高效地实现，因此在模型设计时需要兼顾准确率和实时性，并使模型能在资源有限的边缘设备上部署运行。然而目前轻量化人体姿态估计模型存在随着计算复杂度的降低而精度显著下降的问题，无法兼顾准确率和速度，因而在边缘设备中难以部署。为解决这一问题，采用轻量级主干网络EMO（efficient model），设计一种跨阶段注意力机制，通过借鉴跨阶段特征分布信息引导不同尺度特征对齐，并根据不同阶段特征的权重设计简单有效的多层特征融合方法。引入特征融合监督损失函数，直接优化多尺度特征融合过程，并对特征解码器进行轻量化改进，使模型在速度与精度上达到平衡。在COCO与MPII数据集上的测试结果表明，与基准模型相比，模型在降低复杂度的情况下达到了更优的准确率，并且优于主流轻量化模型。

关键词: 人体姿态估计, 多尺度特征融合, 轻量化网络, 特征融合监督

CHEN Xianglong, LI Songyang, CHEN Enqing, GUO Xin, WANG Song. Lightweight Human Pose Estimation with Joint Cross-Stage Information Fusion[J]. Computer Engineering and Applications, 2025, 61(16): 160-170.

陈相龙, 李松洋, 陈恩庆, 郭新, 汪松. 联合跨阶段信息的轻量化人体姿态估计[J]. 计算机工程与应用, 2025, 61(16): 160-170.

References

[1] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 936-944.
[2] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7103-7112.
[3] 王燕妮, 胡敏, 韩世鹏, 等. 多尺度和多层级特征融合的人体姿态估计[J]. 计算机工程与应用, 2025, 61(6): 199-209. WANG Y N, HU M, HAN S P, et al. Human pose estimation with multi-scale and multi-level feature fusion[J]. Computer Engineering and Applications. 2025, 61(6): 199-209.
[4] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5686-5696.
[5] DAI Y M, GIESEKE F, OEHMCKE S, et al. Attentional feature fusion[C]//Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2021: 3559-3568.
[6] LI X, WANG W H, HU X L, et al. Selective kernel networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 510-519.
[7] ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: split-attention networks[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2022: 2735-2745.
[8] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141.
[9] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 1653-1660.
[10] TOMPSON J, JAIN A, LECUN Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[C]//Advances in Neural Information Processing Systems 27, 2014.
[11] WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4724-4732.
[12] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 483-499.
[13] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 472-487.
[14] YANG S, QUAN Z B, NIE M, et al. TransPose: keypoint localization via transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11782-11792.
[15] XU Y, ZHANG J, ZHANG Q, et al. ViTPose: simple vision transformer baselines for human pose estimation[C]//Advances in Neural Information Processing Systems 35, 2022: 38571-38584.
[16] FAN D P, WANG W G, CHENG M M, et al. Shifting more attention to video salient object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 8546-8556.
[17] FU K R, FAN D P, JI G P, et al. JL-DCF: joint learning and densely-cooperative fusion framework for RGB-D salient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 3049-3059.
[18] MNIH V, HEESS N, GRAVES A, et al. Recurrent models of visual attention[C]//Advances in Neural Information Processing Systems 27, 2014.
[19] JADERBERG M, SIMONYAN K, ZISSERMAN A, et al. Spatial transformer networks[C]//Advances in Neural Information Processing Systems 28, 2015.
[20] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 3-19.
[21] WANG Q L, WU B G, ZHU P F, et al. ECA-net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 11531-11539.
[22] HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[23] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.
[24] HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 1314-1324.
[25] ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6848-6856.
[26] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 122-138.
[27] 冯明文, 徐杨, 张永丹, 等. 结合动态分裂卷积和注意力的多尺度人体姿态估计[J]. 计算机工程与应用, 2024, 60(22): 219-229.
FENG M W, XU Y, ZHANG Y D, et al. Multi-scale human posture estimation based on dynamic split convolution and attention[J]. Computer Engineering and Applications, 2024, 60(22): 219-229.
[28] YU C Q, XIAO B, GAO C X, et al. Lite-HRNet: a lightweight high?resolution network[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 10435-10445.
[29] MEHTA S, RASTEGARI M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[J]. arXiv:2110.02178, 2021.
[30] ZHANG J N, LI X T, LI J, et al. Rethinking mobile block for efficient attention-based models[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 1389-1400.
[31] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 11966-11976.
[32] PARK N, KIM S. How do vision transformers work?[C]//Proceedings of the 10th International Conference on Learning Representations, 2022.
[33] WANG Y H, LI M Y, CAI H, et al. Lite pose: efficient architecture design for 2D human pose estimation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 13116-13126.
[34] LIU Y, ZHANG S Y, CHEN J C, et al. Improving pixel-based MIM by reducing wasted modeling capability[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 5338-5349.
[35] HJELM R D, FEDOROV A, LAVOIE-MARCHILDON S, et al. Learning deep representations by mutual information estimation and maximization[C]//Proceedings of the 6th International Conference on Learning Representations, 2018.
[36] FEDERICI M, DUTTA A, FORRé P, et al. Learning robust representations via multi-view information bottleneck[C]//Proceedings of the 8th International Conference on Learning Representations, 2020.
[37] LIU Z G, FENG R Y, CHEN H M, et al. Temporal feature alignment and mutual information maximization for video-based human pose estimation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10996-11006.
[38] ZHAO L, WANG Y X, ZHAO J P, et al. Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 12788-12797.
[39] TIAN X D, ZHANG Z Z, LIN S H, et al. Farewell to mutual information: variational distillation for cross-modal person re?identification[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1522-1531.
[40] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[41] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 3686-3693.
[42] ZHANG F, ZHU X T, DAI H B, et al. Distribution-aware coordinate representation for human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 7091-7100.
[43] ZHAO A R, LI J L, ZENG H T, et al. DSPose: dual-space-driven keypoint topology modeling for human pose estimation[J]. Sensors, 2023, 23(17): 7626.
[44] 高坤, 李汪根, 束阳, 等. 融入密集连接的多尺度轻量级人体姿态估计[J]. 计算机工程与应用, 2022, 58(24): 196-204.
GAO K, LI W G, SHU Y, et al. Multi-scale lightweight human pose estimation with dense connections[J]. Computer Engineering and Applications, 2022, 58(24): 196-204.
[45] LI K, WANG S J, ZHANG X, et al. Pose recognition with cascade transformers[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1944-1953.
[46] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: learning keypoint tokens for human pose estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11293-11302.