Human Pose Estimation with Multi-Scale and Multi-Level Feature Fusion

doi:10.3778/j.issn.1002-8331.2310-0407

Abstract

Abstract: The accuracy improvement of human pose estimation usually depends on feature fusion. However, the existing feature fusion strategies often ignore the interaction between scale features and level features. The fusion of single mode may result in less significant feature expression. To make full use of the complementarity between different features, a new multi-scale and multi-level feature fusion network (MSLNet) is proposed. The high-resolution network (HRNet) is used as the backbone to exchange information between feature maps of different resolutions through cross-scale information exchange, and to obtain both fine-grained and coarse-grained pose features. The expectation maximization attention bidirectional feature pyramid network (EMA-BiFPN) is introduced to achieve multi-level feature aggregation after multi-scale feature fusion. The details and correlation information of human pose are captured from local to global. A keypoint detection head composed of residual structure is designed to complete the final fusion of output features and improve the accuracy of human keypoint detection. The experimental results show that MSLNet achieves the best accuracy of 75.8% and 91.1% on COCO and MPII datasets, respectively. It is fully verified that MSLNet can make use of the complementarity between scale features and level features to improve the accuracy of human pose estimation.

Key words: high-resolution network (HRNet), human pose estimation, expectation maximization attention, bidirectional feature pyramid network, feature fusion

摘要： 人体姿态估计的精度提升通常依赖于特征融合，但是现有特征融合策略往往忽略了尺度特征和层级特征之间的交互作用。为了充分利用不同特征之间的互补性，提出了一种新特征融合策略用以提升人体姿态估计精度，即多尺度和多层级特征融合网络（multi-scale and multi-level network，MSLNet）。采用高分辨率网络（high-resolution network，HRNet）作为主干，通过跨尺度信息交互，实现不同分辨率特征图之间的信息交换，获取同时包含细粒度和粗粒度的姿态特征；引入期望最大化注意力-加权双向特征金字塔网络（expectation maximization attention-bidirectional feature pyramid network，EMA-BiFPN），实现多尺度特征融合后的多层级特征聚合，从局部到全局捕捉人体姿态的细节和关联信息；设计由残差结构组成的关键点检测头，完成输出特征的最终融合并提升人体关键点检测准确率。实验结果表明，MSLNet在COCO和MPII数据集上分别取得了75.8%和91.1%的准确率，实现了最优精度，充分验证了MSLNet能够融合尺度和层级之间的互补特征，进而提升人体姿态估计精度。

关键词: 高分辨率网络（HRNet）, 人体姿态估计, 期望最大化注意力, 双向特征金字塔网络, 特征融合

WANG Yanni, HU Min, HAN Shipeng, CHEN Yixuan, LYU Hao. Human Pose Estimation with Multi-Scale and Multi-Level Feature Fusion[J]. Computer Engineering and Applications, 2025, 61(6): 199-209.

王燕妮, 胡敏, 韩世鹏, 陈艺瑄, 吕昊. 多尺度和多层级特征融合的人体姿态估计[J]. 计算机工程与应用, 2025, 61(6): 199-209.

References

[1] MARCOS-RAMIRO A, PIZARRO D, MARRON-ROMERA M, et al. Let your body speak: communicative cue extraction on natural interaction using RGBD data[J]. IEEE Transactions on Multimedia, 2015, 17(10): 1721-1732.
[2] ELKHOLY A, HUSSEIN M E, GOMAA W, et al. Efficient and robust skeleton-based quality assessment and abnormality detection in human action performance[J]. IEEE Journal of Biomedical and Health Informatics, 2020, 24(1): 280-291.
[3] ANDRILUKA M, IQBAL U, INSAFUTDINOV E, et al. PoseTrack: a benchmark for human pose estimation and tracking[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018: 5167-5176.
[4] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018: 7103-7112.
[5] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019: 5686-5696.
[6] LI X, ZHONG Z, WU J, et al. Expectation-maximization attention networks for semantic segmentation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019: 9166-9175.
[7] TAN M, PANG R, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020: 10778-10787.
[8] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014: 1653-1660.
[9] WEI S E, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 4724-4732.
[10] FANG H S, XIE S, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, 2017: 2353-2362.
[11] KE L, CHANG M C, QI H, et al. DetPoseNet: improving multi-person pose estimation via coarse-pose filtering[J]. IEEE Transactions on Image Processing, 2022, 31: 2782-2795.
[12] ZHANG T, LIAN J, WEN J, et al. Multi-person pose estimation in the wild: using adversarial method to train a top-down pose estimation network[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023, 53(7): 3919-3929.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. arXiv:1706.03762, 2017.
[14] YANG S, QUAN Z, NIE M, et al. TransPose: keypoint localization via transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 2021: 11782-11792.
[15] LI Y, ZHANG S, WANG Z, et al. TokenPose: learning keypoint tokens for human pose estimation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 2021: 11293-11302.
[16] YUAN Y, FU R, HUANG L, et al. HRFormer: high-resolution transformer for dense prediction[J]. arXiv:2110.09408, 2021.
[17] PISHCHULIN L, INSAFUTDINOV E, TANG S, et al. DeepCut: joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 4929-4937.
[18] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[19] INSAFUTDINOV E, PISHCHULIN L, ANDRES B, et al. DeeperCut: a deeper, stronger, and faster multi-person pose estimation model[C]//Proceedings of the European Conference on Computer Vision (ECCV 2016). Cham: Springer International Publishing, 2016: 34-50.
[20] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017: 1302-1310.
[21] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]//Proceedings of the European Conference on Computer Vision (ECCV 2018). Cham: Springer International Publishing, 2018: 282-299.
[22] KREISS S, BERTONI L, ALAHI A. PifPaf: composite fields for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019: 11969-11978.
[23] CHENG B, XIAO B, WANG J, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020: 5385-5394.
[24] LI J, WANG M. Multi-person pose estimation with accurate heatmap regression and greedy association[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(8): 5521-5535.
[25] NIE X, FENG J, XING J, et al. Pose partition networks for multi-person pose estimation[C]//Proceedings of the European Conference on Computer Vision (ECCV 2018). Cham: Springer International Publishing, 2018: 705-720.
[26] JIN L, WANG X, NIE X, et al. Grouping by center: predicting centripetal offsets for the bottom-up human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 3364-3374.
[27] CHENG Y, AI Y, WANG B, et al. Bottom-up 2D pose estimation via dual anatomical centers for small-scale persons[J]. Pattern Recognition, 2023, 139: 109403.
[28] NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the European Conference on Computer Vision (ECCV 2016). Cham: Springer International Publishing, 2016: 483-499.
[29] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017: 936-944.
[30] ZHOU T, YANG Y, WANG W. Differentiable multi-granlarity human parsing[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(7): 8296-8310.
[31] WANG Y J, LUO Y M, BAI G H, et al. UformPose: a U-shaped hierarchical multi-scale keypoint-aware framework for human pose estimation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(4): 1697-1709.
[32] SHI L, ZHOU Y, WANG J, et al. Compact global association based adaptive routing framework for personnel behavior understanding[J]. Future Generation Computer Systems, 2023, 141: 514-525.
[33] XU J, LIU W, XING W, et al. MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation[J]. The Visual Computer, 2023, 39(5): 2005-2019.
[34] WANG X, TONG J, WANG R. Attention refined network for human pose estimation[J]. Neural Processing Letters, 2021, 53(4): 2853-2872.
[35] YUE L, LI J, LIU Q. Body parts relevance learning via expectation?maximization for human pose estimation[J]. Multimedia Systems, 2021, 27(5): 927-939.
[36] 冯明文, 徐杨, 张永丹, 等. 结合动态分裂卷积和注意力的多尺度人体姿态估计[J].计算机工程与应用, 2024, 60(22): 219-229.
FENG M W, XU Y, ZHANG Y D, et al. Combining dynamic split convolutions and attention for multi-scale human pose estimation[J]. Computer Engineering and Applications, 2024, 60(22): 219-229.
[37] LIN T Y, MAIRE M, BELONGIE S, et al. Mi-crosoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision (ECCV 2014). Cham: Springer International Publishing, 2014: 740-755.
[38] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014: 3686-3693.
[39] XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking[C]//Proceedings of the European Conference on Computer Vision (ECCV 2018). Cham: Springer International Publishing, 2018: 472-487.
[40] SUN X, ADAMU M J, ZHANG R, et al. Pixel-coordinate-induced human pose high-precision estimation method[J]. Electronics, 2023, 12(7): 1648.
[41] ZHAO A, LI J, ZENG H, et al. DSPose: dual-space-driven keypoint topology modeling for human pose estimation[J]. Sensors, 2023, 23(17): 7626.
[42] PAVAO A, GUYON I, LETOURNEL A C, et al. CodaLab competitions: an open source platform to organize scientific challenges[J]. Journal of Machine Learning Research, 2023, 24(198):1-6.
[43] ZHANG F, ZHU X, DAI H, et al. Distribution-aware coordinate representation for human pose estimation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020: 7091-7100.
[44] HUANG J, ZHU Z, GUO F, et al. The devil is in the details: delving into unbiased data processing for human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020: 5699-5708.
[45] WANG R, WU W, WANG X. Enhancing multi-scale information exchange and feature fusion for human pose estimation[J]. The Visual Computer, 2023, 39(10): 4751-4765.