Research on Lightweight and Efficient Bottom-Up Human Pose Estimation Algorithm

doi:10.3778/j.issn.1002-8331.2306-0392

Abstract

Abstract: Aiming at the problems of complexity and high computational cost of human pose estimation algorithm model, a bottom-up lightweight and efficient human pose estimation network based on HigherHRNet (lightweight and efficient HigherHRNet, LE-HigherHRNet) is proposed. Depthwise separable convolutions are used to reduce the number of parameters of the feature extraction network. The coordinate attention mechanism is introduced to better capture position information and channel feature information, highlighting the feature information of small objects in the image and occluding key points of the human body. The proposed network achieves multi-stage resolution connection through parallel connection, which can enhance the ability to extract shallow feature information. This paper uses skip links in the network and designs lightweight CARAFE upsampling, retains and reconstructs feature information, and enhances spatial position information between high and low resolution. The experimental results show that, compared with HigherHRNet, while slightly improving the accuracy, it significantly reduces the number of model parameters and reduces the computational complexity.

Key words: human pose estimation, lightweight network, coordinate attention, CARAFE upsampling

摘要： 针对人体姿态估计算法模型复杂和计算成本高的问题，提出了一种基于HigherHRNet的自底向上轻量高效的人体姿态估计网络（lightweight and efficient HigherHRNet，LE-HigherHRNet）。采用深度可分离卷积（depthwise separable convolutions），减少特征提取网络的参数数量；引入协调注意力机制（coordinate attention），更好地捕获位置信息和通道特征信息，突出图像中小目标和遮挡人体关键点的特征信息；通过平行连接实现多阶段分辨率的连接，增强特征信息的提取能力；在网络中采用跳跃链接并设计轻量级CARAFE上采样，保留和重建特征信息，增强高低分辨率之间的空间位置信息。实验结果表明，相比较HigherHRNet在小幅提升精度的同时，显著减少了模型参数量，降低了运算复杂度。

关键词: 人体姿态估计, 轻量级网络, 协调注意力机制, CARAFE上采样

MA Sai, GE Haibo, HE Wenhao, CHENG Mengyang, AN Yu. Research on Lightweight and Efficient Bottom-Up Human Pose Estimation Algorithm[J]. Computer Engineering and Applications, 2024, 60(18): 217-229.

马赛, 葛海波, 何文昊, 程梦洋, 安玉. 轻量高效的自底向上人体姿态估计算法研究[J]. 计算机工程与应用, 2024, 60(18): 217-229.

References

[1] SZELISKI R. Computer vision: algorithms and applications[M]. [S.l.]: Springer, 2022.
[2] HIRSCHBERG J, MANNING C D. Advances in natural language processing[J]. Science, 2015, 349(6245): 261-266.
[3] GAIKWAD S K, GAWALI B W, YANNAWAR P. A review on speech recognition technique[J]. International Journal of Computer Applications, 2010, 10(3): 16-24.
[4] MANIFAVAS C, HATZIVASILIS G, FYSARAKIS K, et al. A survey of lightweight stream ciphers for embedded systems[J]. Security and Communication Networks, 2016, 9(10): 1226-1246.
[5] GU J, WANG Z, KUEN J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377.
[6] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//Proceedings of 13th European Conference on Computer Vision(ECCV 2014), Zurich, Switzerland, September 6-12, 2014. [S.l.]: Springer International Publishing, 2014: 740-755.
[7] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 3686-3693.
[8] WANG C H, HUANG K Y, YAO Y, et al. Lightweight deep learning: an overview[J]. IEEE Consumer Electronics Magazine, 2024, 13(4): 51-64.
[9] LI Y, JIA S, LI Q. BalanceHRNet: an effective network for bottom-up human pose estimation[J]. Neural Networks, 2023, 161: 297-305.
[10] CHENG B, XIAO B, WANG J, et al. Higherhrnet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 5386-5395.
[11] LI X, LI C, RAHAMAN M M, et al. A comprehensive review of computer-aided whole-slide image analysis: from datasets to feature extraction, segmentation, classification and detection approaches[J]. Artificial Intelligence Review, 2022, 55(6): 4809-4878.
[12] CHEN Y, XIA R, YANG K, et al. MFFN: image super-resolution via multi-level features fusion network[J]. The Visual Computer, 2023, 40: 489-504.
[13] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[14] HOU Q, ZHOU D, FENG J. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13713-13722.
[15] WANG J, CHEN K, XU R, et al. Carafe: content-aware reassembly of features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 3007-3016.
[16] DROZDZAL M, VORONTSOV E, CHARTRAND G, et al. The importance of skip connections in biomedical image segmentation[C]//Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Cham: Springer, 2016: 179-187.
[17] LIU W, BAO Q, SUN Y, et al. Recent advances of monocular 2D and 3D human pose estimation: a deep learning perspective[J]. ACM Computing Surveys, 2022, 55(4): 1-41.
[18] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7103-7112.
[19] HE K, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2961-2969.
[20] XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 466-481.
[21] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5693-5703.
[22] YU C, XIAO B, GAO C, et al. Lite-HRNet: a lightweight high-resolution network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 10440-10450.
[23] CAO Z, SIMON T, WEI S E, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 7291-7299.
[24] NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, October 11-14, 2016. [S.l.]: Springer International Publishing, 2016: 483-499.
[25] KOCABAS M, KARAGOZ S, AKBAS E. MultiPoseNet: fast multi-person pose estimation using pose residual network[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 417-433.
[26] GENG Z, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14676-14686.
[27] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size[J]. arXiv: 1602.07360, 2016.
[28] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[29] ZHANG X, ZHOU X, LIN M, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6848-6856.
[30] HOWARD A G, ZHU M, CHEN B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[31] 邓辉, 徐杨. 融入注意力和密集连接的轻量型人体姿态估计[J]. 计算机工程与应用, 2022, 58(16): 265-273.
DENG H, XU Y. Lightweight human pose estimation based on attention and dense connection[J]. Computer Engineering and Applications, 2022, 58(16): 265-273.
[32] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[33] WOO S, PARK J, LEE J Y, et al. Cbam: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 3-19.
[34] HOU Q, ZHOU D, FENG J. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13713-13722.
[35] LI J, WANG C, ZHU H, et al. Crowdpose: efficient crowded scenes pose estimation and a new benchmark[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10863-10872.
[36] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 269-286.
[37] PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose estimation in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4903-4911.
[38] FANG H S, XIE S, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2334-2343.
[39] HUANG S, GONG M, TAO D. A coarse-fine network for keypoint localization[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 3028-3037.