语义增强和自适应多尺度特征融合的人体姿态估计

doi:10.3778/j.issn.1002-8331.2407-0177

摘要/Abstract

摘要： 由于关键点尺度较小且位置敏感，如何有效提取空间和语义信息一直是姿态估计任务的主要挑战。为此，提出了一种语义增强和自适应多尺度特征融合的人体姿态估计模型（SAMFFNet）。SAMFFNet以轻量级的MobileNetV2作为骨干网络构建特征金字塔，利用EfficientViT生成尺度感知的全局语义，在设计的深层语义注入模块中，利用上下文引导的注意力将全局语义与局部特征融合，增强关键点的语义表示。提出了自适应多尺度特征融合模块，该模块通过集成大型选择卷积核模块（LSK）和跨层交互机制，能根据输入特征动态地调节较大的空间感受野，并增强不同尺度特征之间的信息交互。实验结果表明，在COCO验证集上，SAMFFNet与使用的骨干网络相比，精度指标提升了6.1个百分点，达到70.7%，虽然比大模型SimpleBaseline的精度略低，但参数量减少了85.0%，计算量降低了78.3%。同样在MPII数据集上，与骨干网络相比也实现了2.3个百分点的精度提升。综合COCO与MPII数据集上的表现，充分证实了SAMFFNet在强化人体结构特征与特征融合策略上的有效性。

关键词: 人体姿态估计, 语义增强, 上下文引导的注意力（CGA）, 自适应特征融合, 特征金字塔（FPN）

Abstract: Due to the small scale and sensitive location of keypoints, how to effectively extract spatial and semantic information has always been the main challenge of pose estimation task. In order to solve this problem, this paper proposes a semantic-enhanced and adaptive multi-scale feature fusion network (SAMFFNet) for human pose estimation. SAMFFNet utilizes the lightweight MobileNetV2 as the backbone network to build the feature pyramid, and uses EfficientViT to generate scale-aware global semantics. In the designed deep semantic injection module, the content-guided attention is used to fuse global semantics with local features to enhance the semantic representation of key points. Furthermore, an adaptive multi-scale feature fusion module is proposed, which can dynamically adjust the large spatial receptive field according to the input features and enhance the information interaction between features at different scales by integrating the large selective convolution kernel module (LSK) and the cross-layer interaction mechanism. The experimental results show that on the COCO validation set, SAMFFNet has improved its accuracy index by 6.1 percentage points compared to the backbone network, reaching 70.7%. Although its accuracy is slightly lower than that of the larger model SimpleBaseline, it has reduced the number of parameters by 85.0% and the computational complexity by 78.3%. On the MPII dataset, an accuracy improvement of 2.3 percentage points is also achieved compared to the backbone network. The comprehensive performance on the COCO and MPII datasets fully confirms the effectiveness of SAMFFNet in enhancing human structural features and feature fusion strategies.

Key words: human pose estimation, semantic augmentation, content-guided attention(CGA), adaptive feature fusion, feature pyramid network (FPN)

张家波, 何阿娟, 唐上松. 语义增强和自适应多尺度特征融合的人体姿态估计[J]. 计算机工程与应用, 2025, 61(23): 212-223.

ZHANG Jiabo, HE Ajuan, TANG Shangsong. Human Pose Estimation with Semantic Enhancement and Adaptive Multi-Scale Feature Fusion[J]. Computer Engineering and Applications, 2025, 61(23): 212-223.

参考文献

[1] XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 472-487.
[2] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5686-5696.
[3] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7103-7112.
[4] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 936-944.
[5] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.
[6] LIU X Y, PENG H W, ZHENG N X, et al. EfficientViT: memory efficient vision transformer with cascaded group attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14420-14430.
[7] LI Y, HOU Q, ZHENG Z, et al. Large selective kernel network for remote sensing object detection[J]. arXiv:2303. 09030, 2023.
[8] SU K, YU D D, XU Z Q, et al. Multi-person pose estimation with enhanced channel-wise and spatial information[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5667-5675.
[9] CHEN Y, SHEN C H, WEI X S, et al. Adversarial PoseNet: a structure-aware convolutional network for human pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 1221-1230.
[10] RAFI U, DOERING A, LEIBE B, et al. Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 36-52.
[11] ZHANG H R, QI Y F, CHEN H L, et al. LSDNet: lightweight stochastic depth network for human pose estimation[J]. The Visual Computer, 2025, 41(1): 257-270.
[12] YANG S, QUAN Z B, NIE M, et al. TransPose: keypoint localization via transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11782-11792.
[13] LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: learning keypoint tokens for human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 11293-11302.
[14] 吴程鹏, 谭光兴, 陈海峰, 等. 融合Transformer和注意力的轻量高效人体姿态估计[J]. 计算机工程与应用, 2024, 60(22): 197-208.
WU C P, TAN G X, CHEN H F, et al. Lightweight and efficient human pose estimation fusing Transformer and attention[J]. Computer Engineering and Applications, 2024, 60(22): 197-208.
[15] KIM G, KIM H, KONG K, et al. Human body-aware feature extractor using attachable feature corrector for human pose estimation[J]. IEEE Transactions on Multimedia, 2023, 25: 5789-5799.
[16] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1302-1310.
[17] NIE X C, FENG J S, XING J L, et al. Pose partition networks for multi-person pose estimation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 705-720.
[18] CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5385-5394.
[19] 江春灵, 曾碧, 姚壮泽, 等. 融合权重自适应损失和注意力的人体姿态估计[J]. 计算机工程与应用, 2023, 59(18): 145-153.
JIANG C L, ZENG B, YAO Z Z, et al. Human pose estimation fusing weight adaptive loss and attention[J]. Computer Engineering and Applications, 2023, 59(18): 145-153.
[20] WANG Y H, LI M Y, CAI H, et al. LitePose: efficient architecture design for 2D human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 13116-13126.
[21] LUO Z X, WANG Z C, HUANG Y, et al. Rethinking the heatmap regression for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 13259-13268.
[22] KOCABAS M, KARAGOZ S, AKBAS E. MultiPoseNet: fast multi-person pose estimation using pose residual network[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 437-453.
[23] MAJI D, NAGORI S, MATHEW M, et al. YOLO-Pose: enhancing YOLO for multi person pose estimation using object keypoint similarity loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2022: 2636-2645.
[24] KE L P, CHANG M C, QI H G, et al. Multi-scale structure-aware network for human pose estimation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 731-746.
[25] CHU X, YANG W, OUYANG W, et al. Multi-context attention for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5669-5678.
[26] LIU W T, CHEN J, LI C, et al. A cascaded inception of inception network with attention modulated feature fusion for human pose estimation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 7170-7177.
[27] SHAN B G, SHI Q X, YANG F. MSRT: multi-scale representation Transformer for regression-based human pose estimation[J]. Pattern Analysis and Applications, 2023, 26(2): 591-603.
[28] HUANG J J, ZHU Z, GUO F, et al. The devil is in the details: delving into unbiased data processing for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5699-5708.
[29] ZHANG W Q, HUANG Z L, LUO G Z, et al. TopFormer: token pyramid transformer for mobile semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 12073-12083.
[30] LIU S, QI L, QIN H F, et al. Path aggregation network for instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8759-8768.
[31] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2014: 740-755.
[32] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 3686-3693.
[33] CHEN Y P, DAI X Y, LIU M C, et al. Dynamic convolution: attention over convolution kernels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 11027-11036.
[34] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNetV2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 122-138.
[35] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2016: 483-499.
[36] LI K, WANG S J, ZHANG X, et al. Pose recognition with cascade transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1944-1953.
[37] 高坤, 李汪根, 束阳, 等. 融入密集连接的多尺度轻量级人体姿态估计[J]. 计算机工程与应用, 2022, 58(24): 196-204.
GAO K, LI W G, SHU Y, et al. Multi-scale lightweight human pose estimation with dense connections[J]. Computer Engineering and Applications, 2022, 58(24): 196-204.
[38] FANG H S, XIE S Q, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2353-2362.
[39] WANG W H, XIE E Z, LI X, et al. Pyramid vision Transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 548-558.