计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (23): 212-223.DOI: 10.3778/j.issn.1002-8331.2407-0177

• 图形图像处理 • 上一篇    下一篇

语义增强和自适应多尺度特征融合的人体姿态估计

张家波,何阿娟,唐上松   

  1. 重庆邮电大学 通信与信息工程学院,重庆 400065
  • 出版日期:2025-12-01 发布日期:2025-12-01

Human Pose Estimation with Semantic Enhancement and Adaptive Multi-Scale Feature Fusion

ZHANG Jiabo, HE Ajuan, TANG Shangsong   

  1. School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Online:2025-12-01 Published:2025-12-01

摘要: 由于关键点尺度较小且位置敏感,如何有效提取空间和语义信息一直是姿态估计任务的主要挑战。为此,提出了一种语义增强和自适应多尺度特征融合的人体姿态估计模型(SAMFFNet)。SAMFFNet以轻量级的MobileNetV2作为骨干网络构建特征金字塔,利用EfficientViT生成尺度感知的全局语义,在设计的深层语义注入模块中,利用上下文引导的注意力将全局语义与局部特征融合,增强关键点的语义表示。提出了自适应多尺度特征融合模块,该模块通过集成大型选择卷积核模块(LSK)和跨层交互机制,能根据输入特征动态地调节较大的空间感受野,并增强不同尺度特征之间的信息交互。实验结果表明,在COCO验证集上,SAMFFNet与使用的骨干网络相比,精度指标提升了6.1个百分点,达到70.7%,虽然比大模型SimpleBaseline的精度略低,但参数量减少了85.0%,计算量降低了78.3%。同样在MPII数据集上,与骨干网络相比也实现了2.3个百分点的精度提升。综合COCO与MPII数据集上的表现,充分证实了SAMFFNet在强化人体结构特征与特征融合策略上的有效性。

关键词: 人体姿态估计, 语义增强, 上下文引导的注意力(CGA), 自适应特征融合, 特征金字塔(FPN)

Abstract: Due to the small scale and sensitive location of keypoints, how to effectively extract spatial and semantic information has always been the main challenge of pose estimation task. In order to solve this problem, this paper proposes a semantic-enhanced and adaptive multi-scale feature fusion network (SAMFFNet) for human pose estimation. SAMFFNet utilizes the lightweight MobileNetV2 as the backbone network to build the feature pyramid, and uses EfficientViT to generate scale-aware global semantics. In the designed deep semantic injection module, the content-guided attention is used to fuse global semantics with local features to enhance the semantic representation of key points. Furthermore, an adaptive multi-scale feature fusion module is proposed, which can dynamically adjust the large spatial receptive field according to the input features and enhance the information interaction between features at different scales by integrating the large selective convolution kernel module (LSK) and the cross-layer interaction mechanism. The experimental results show that on the COCO validation set, SAMFFNet has improved its accuracy index by 6.1 percentage points compared to the backbone network, reaching 70.7%. Although its accuracy is slightly lower than that of the larger model SimpleBaseline, it has reduced the number of parameters by 85.0% and the computational complexity by 78.3%. On the MPII dataset, an accuracy improvement of 2.3 percentage points is also achieved compared to the backbone network. The comprehensive performance on the COCO and MPII datasets fully confirms the effectiveness of SAMFFNet in enhancing human structural features and feature fusion strategies.

Key words: human pose estimation, semantic augmentation, content-guided attention(CGA), adaptive feature fusion, feature pyramid network (FPN)