Human Pose Estimation Fusing Weight Adaptive Loss and Attention

doi:10.3778/j.issn.1002-8331.2301-0042

Abstract

Abstract: There is an imbalance between foreground and background samples in the bottom-up human pose estimation method. Meanwhile, the high-resolution network cannot effectively obtain channel information and spatial location information during feature extraction and feature fusion. To address these problems, this paper presents WA-HRNet（weight-adaptive fusing attention HRNet）：a bottom-up human pose estimation network based on the high-resolution network （HigherHRNet）. Firstly, a weight-adaptive loss function is proposed to adaptively adjust the loss weight of different regions, so that HigherHRNet pays more attention to the central region of human key points during training. At the same time, in order to obtain rich global information and further locate the keypoint area, efficient global attention is proposed to strengthen the representation of the central area of the keypoint. Finally, heatmap distribution modulation is introduced to improve the accuracy of decoding keypoint locations in the heatmap. Experiments conducted on the CrowdPose dataset as well as the COCO2017 dataset show that WA-HRNet improves its AP value by 5.8 percentage points on the CrowdPose test set and 1.8 percentage points on the COCO2017 test-dev set to 72.3% compared to the baseline HigherHRNet, outperforming other mainstream algorithms for bottom-up human pose estimation.

Key words: human pose estimation, bottom-up, attention, high-resolution network

摘要： 在自底向上人体姿态估计方法中存在前景和背景样本不平衡的问题，同时高分辨率网络在特征提取和特征融合时不能有效获得通道信息和空间位置信息。针对以上问题，提出以高分辨率网络（HigherHRNet）为基础融合权重自适应和注意力的自底向上人体姿态估计网络WA-HRNet（weight-adaptive fusing attention HRNet）。提出权重自适应损失函数，自适应调整不同区域的损失权重，使得HigherHRNet训练时更加关注人体关键点中心区域；同时为了获取丰富的全局信息进一步定位关键点区域，提出高效全局注意力，加强关键点中心区域的表征；引入热力图分布调制，提高热力图解码关键点位置的准确性。在CrowdPose数据集以及COCO2017数据集上的实验表明，与基线HigherHRNet相比，WA-HRNet在CrowdPose测试集上AP值提升了5.8个百分点，在COCO2017测试集上AP值提升了1.8个百分点达到了72.3%，优于其他自底向上人体姿态估计主流算法。

关键词: 人体姿态估计, 自底向上, 注意力, 高分辨率网络

JIANG Chunling, ZENG Bi, YAO Zhuangze, DENG Bin. Human Pose Estimation Fusing Weight Adaptive Loss and Attention[J]. Computer Engineering and Applications, 2023, 59(18): 145-153.

江春灵, 曾碧, 姚壮泽, 邓斌. 融合权重自适应损失和注意力的人体姿态估计[J]. 计算机工程与应用, 2023, 59(18): 145-153.

References

[1] SONG L，YU G，YUAN J，et al.Human pose estimation and its application to action recognition：a survey[J].Journal of Visual Communication and Image Representation，2021，76：103055.
[2] 钱慧芳，易剑平，付云虎.基于深度学习的人体动作识别综述[J].计算机科学与探索，2021，15（3）：438-455.
QIAN H F，YI J P，FU Y H.Review of human action recognition based on deep learning[J].Journal of Frontiers of Computer Science and Technology，2021，15（3）：438-455.
[3] 苏江毅，宋晓宁，吴小俊，等.多模态轻量级图卷积人体骨架行为识别方法[J].计算机科学与探索，2021，15（4）：733-742.
SU J Y，SONG X N，WU X J，et al.Skeleton based action recognition algorithm on multi-modal lightweight graph convolutional network[J].Journal of Frontiers of Computer Science and Technology，2021，15（4）：733-742.
[4] 何坚，郭泽龙，刘乐园，等.基于滑动窗口和卷积神经网络的可穿戴人体活动识别技术[J].电子与信息学报，2022，44（1）：168-177.
HE J，GUO Z L，LIU L Y，et al.Human activity recognition technology based on sliding window and convolutional neural network[J].Journal of Electronics & Information Technology，2022，44（1）：168-177.
[5] DANG Q，YIN J，WANG B，et al.Deep learning based 2d human pose estimation：a survey[J].Tsinghua Science and Technology，2019，24（6）：663-676.
[6] NEWELL A，YANG K，DENG J.Stacked hourglass networks for human pose estimation[C]//European Conference on Computer Vision.Cham：Springer，2016：483-499.
[7] XIAO B，WU H，WEI Y.Simple baselines for human pose estimation and tracking[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：466-481.
[8] CAO Z，SIMON T，WEI S E，et al.Realtime multi-person 2d pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：7291-7299.
[9] PAPANDREOU G，ZHU T，CHEN L C，et al.Personlab：person pose estimation and instance segmentation with a bottom-up，part-based，geometric embedding model[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：269-286.
[10] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[11] SUN K，XIAO B，LIU D，et al.Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5693-5703.
[12] CHENG B，XIAO B，WANG J，et al.Higherhrnet：scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：5386-5395.
[13] LUO Z，WANG Z，HUANG Y，et al.Rethinking the heatmap regression for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：13264-13273.
[14] GENG Z，SUN K，XIAO B，et al.Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：14676-14686.
[15] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft coco：common objects in context[C]//13th European Conference on Computer Vision，Zurich，Switzerland，September 6-12，2014：740-755.
[16] PAPANDREOU G，ZHU T，KANAZAWA N，et al.Towards accurate multi-person pose estimation in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：4903-4911.
[17] REN S，HE K，GIRSHICK R，et al.Faster r-cnn：towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems，2015.
[18] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[19] HUANG J，ZHU Z，GUO F，et al.The devil is in the details：delving into unbiased data processing for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：5700-5709.
[20] FANG H S，XIE S，TAI Y W，et al.Rmpe：regional multi-person pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2334-2343.
[21] PISHCHULIN L，INSAFUTDINOV E，TANG S，et al.Deepcut：joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：4929-4937.
[22] NEWELL A，HUANG Z，DENG J.Associative embedding：end-to-end learning for joint detection and grouping[C]//Advances in Neural Information Processing Systems，2017.
[23] JADERBERG M，SIMONYAN K，ZISSERMAN A.Spatial transformer networks[C]//Advances in Neural Information Processing Systems，2015.
[24] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[25] WOO S，PARK J，LEE J Y，et al.Spatial transformer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems，2015：2017-2025.
[26] WANG Q，WU B，ZHU P，et al.Supplementary material for ‘ECA-Net：efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：13-19.
[27] LIN T Y，GOYAL P，GIRSHICK R，et al.Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2980-2988.
[28] HUANG Z，WANG X，HUANG L，et al.Ccnet：criss-cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：603-612.
[29] HOU Q，ZHOU D，FENG J.Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：13713-13722.
[30] ZHANG F，ZHU X，DAI H，et al.Distribution-aware coordinate representation for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：7093-7102.
[31] RUSSAKOVSKY O，DENG J，SU H，et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision，2015，115（3）：211-252.
[32] KINGMA D P，BA J.Adam：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[33] LI J，WANG C，ZHU H，et al.Crowdpose：efficient crowded scenes pose estimation and a new benchmark[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：10863-10872.
[34] HE K，GKIOXARI G，DOLLáR P，et al.Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2961-2969.