计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (8): 126-132.DOI: 10.3778/j.issn.1002-8331.2010-0317

• 模式识别与人工智能 • 上一篇    下一篇

引入注意力机制的多分辨率人体姿态估计研究

张越,黄友锐,刘鹏坤   

  1. 安徽理工大学 电气与信息工程学院,安徽 淮南 232000
  • 出版日期:2021-04-15 发布日期:2021-04-23

Research on Multi-resolution Human Pose Estimation with Attention Mechanism

ZHANG Yue, HUANG Yourui, LIU Pengkun   

  1. College of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, Anhui 232000, China
  • Online:2021-04-15 Published:2021-04-23

摘要:

针对人体姿态估计任务中多分辨率特征表征直接融合时存在无法有效利用特征图空间特征信息的问题,基于High-Resolution Net(HRNet)进行结构设计,构建出结合了通道域注意力和空间域注意力机制的多分辨率人体姿态估计网络GCT-Nonlocal Net(GNNet),提出了一种基于注意力机制的多分辨率表征融合方法,在不同分辨率表征融合前由空间注意力提取出各分辨率表征更有用的空间特征信息来改进融合单元,使得各分辨率表征间的信息融合效果更佳,最终输出的高分辨率表征含有更丰富的特征信息,同时构造了Gateneck模块和Gateblock模块,其通过引入通道注意力更明确地对通道关系建模从而高效地提取通道信息。在MS COCOVAL 2017进行验证,结果显示提出的GNNet相较于SOTA级表现的HRNet在相当参数量与运算量的情况下获得了更高的准确度,mAP提高了1.4个百分点。实验结果表明,所提方法有效地提高了多分辨率特征表征融合效果。

关键词: 卷积神经网络, 人体姿态估计, 多分辨率特征表征融合, 空间域注意力机制, 通道域注意力机制

Abstract:

In order to solve the problem that spatial information of feature maps is unable to effectively utilize when multi-resolution feature representations are directly fused in human pose estimation task, the multi-resolution human pose estimation network is proposed based on the High-Resolution Net(HRNet) for structural design, namely GCT-Nonlocal Net (GNNet), which combines both channel domain and spatial domain attention mechanism and contains improved exchange units, Gateneck module and Gateblock module. The exchange units are improved to extract more useful spatial information from the various feature representations by adding spatial attention mechanism before the multi-scale fusions, which make the information fusions between the different resolution representations better and result in the final high-resolution representation containing richer representation information. In addition, the Gateneck module and the Gateblock module are able to model channel relationships more explicitly to extract channel information more effectively by introducing channel attention mechanism. The verification results on MS COCO VAL 2017 dataset show that the proposed GNNet achieves higher accuracy with the similar parameter and computation complexities, compared with the state-of-the-art human pose estimation network, HRNet, and the mAP is improved by 1.4 percentage points. As a result, the improved exchanged units make multi-scale information fusions more effective between the various resolution representations.

Key words: convolutional neural network, human pose estimation, multi-resolution feature representation fusion, spatial attention mechanism, channel attention mechanism