融入注意力和密集连接的轻量型人体姿态估计

doi:10.3778/j.issn.1002-8331.2201-0229

摘要/Abstract

摘要： 目前多数人体姿态估计方法聚焦于提升预测结果的准确性，从而造成了网络参数量大和运算复杂度高等问题。为缓解该矛盾，在高分辨率网络的基础上提出一种融入注意力和密集连接方式的轻量型人体姿态估计网络。重新设计高分辨率网络中的瓶颈模块，从而降低部分网络运算复杂度；改进引入的注意力机制并结合密集连接方式构建了轻量型模块，将其替换高分辨率网络的基础模块，使网络保持一定准确性的同时大幅缩减模型参数量和运算复杂度；利用多分辨率特征和反卷积重新设计网络输出的特征融合方式，最大程度提升模型预测精度。在公开数据集MPII和COCO上的实验结果表明，相比较于高分辨率网络，所提网络模型参数量减少了71.5%，在MPII验证集上，运算复杂度缩小了35.8%，在COCO验证集上，运算复杂度缩小了35.2%，平均准确率提升了0.6个百分点，即网络能在保证检测精度的基础上有效降低网络复杂度。

关键词: 人体姿态估计, 高分辨率网络, 注意力, 密集连接, 轻量型

Abstract: At present, most human pose estimation methods focus on improving the accuracy of prediction results, which causes problems such as large network parameters and high computational complexity. To alleviate this contradiction, a lightweight human pose estimation network is proposed based on a high-resolution network that integrates attention and dense connections. Firstly, the bottleneck module in the high-resolution network is redesigned to reduce the computational complexity of part of the network. Secondly, the introduced attention mechanism is improved and a light-weight module is constructed in combination with the dense connection method, which replaces the basic module of the high-resolution network so that the network maintains a certain accuracy while greatly reducing the model parameters and computational complexity. Finally, the feature fusion method of the network output is redesigned by using multi-resolution features and deconvolution to maximize the model prediction accuracy. The experimental results on the public datasets MPII and COCO show that, compared with the high-resolution network, the parameters of the proposed network model are reduced by 71.5%. On the MPII validation set, the computational complexity is reduced by 35.8%. On the COCO validation set, the computational complexity is reduced by 35.2%, and the average accuracy is increased by 0.6 percentage points, that is, the network can effectively reduce the network complexity while ensuring detection accuracy.

Key words: human pose estimation, high-resolution network, attention, dense connection, lightweight

邓辉, 徐杨. 融入注意力和密集连接的轻量型人体姿态估计[J]. 计算机工程与应用, 2022, 58(16): 265-273.

DENG Hui, XU Yang. Lightweight Human Pose Estimation Based on Attention and Dense Connection[J]. Computer Engineering and Applications, 2022, 58(16): 265-273.

参考文献

[1] LUVIZON D C，PICARD D，TABIA H.2D/3D pose estimation and action recognition using multitask deep learning[C]//2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：5137-5146.
[2] VULETIC T，DUFFY A，HAY L，et al.Systematic literature review of hand gestures used in human computer interaction interfaces[J].International Journal of Human-Computer Studies，2019，129：74-94.
[3] LAN Z，ZHU Y，HAUPTMANN A G，et al.Deep local video feature for action recognition[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops，2017：1-7.
[4] KREISS S，BERTONI L，ALAHI A.PifPaf：composite fields for human pose estimation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：11977-11986.
[5] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[6] NEWELL A，YANG K，DENG J.Stacked hourglass networks for human pose estimation[C]//14th European Conference on Computer Vision.Cham：Springer，2016：483-499.
[7] CHEN Y，WANG Z，PENG Y，et al.Cascaded pyramid network for multi-person pose estimation[C]//2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：7103-7112.
[8] XIAO B，WU H，WEI Y.Simple baselines for human pose estimation and tracking[C]//15th European Conference on Computer Vision，2018：466-481.
[9] ZHANG Z，TANG J，WU G.Simple and lightweight human pose estimation[J].arXiv：1911.10346，2019.
[10] SUN K，XIAO B，LIU D，et al.Deep high-resolution representation learning for human pose estimation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5693-5703.
[11] CHENG B，XIAO B，WANG J，et al.HigherHRNet：scale-aware representation learning for bottom-up human pose estimation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：5386-5395.
[12] YU C，XIAO B，GAO C，et al.Lite-HRNet：a lightweight high-resolution network[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：10440-10450.
[13] HAN K，WANG Y，TIAN Q，et al.GhostNet：more features from cheap operations[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：1580-1589.
[14] HUANG G，LIU Z，VAN DER MAATEN L，et al.Densely connected convolutional networks[C]//2021 IEEE Conference on Computer Vision and Pattern Recognition，2017：4700-4708.
[15] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[16] WOO S，PARK J，LEE J Y，et al.CBAM：convolutional block attention module[C]//15th European Conference on Computer Vision，2018：3-19.
[17] WANG Q，WU B，ZHU P，et al.ECA-Net：efficient channel attention for deep convolutional neural networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020.
[18] LIU H，LIU F，FAN X，et al.Polarized self-attention：towards high-quality pixel-wise regression[J].arXiv：2107.00782，2021.
[19] CAO Y，XU J，LIN S，et al.GCNet：non-local networks meet squeeze-excitation networks and beyond[C]//2019 IEEE/CVF International Conference on Computer Vision Workshop，2019：1971-1980.
[20] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[21] ANDRILUKA M，PISHCHULIN L，GEHLER P，et al.2D human pose estimation：new benchmark and state of the art analysis[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition，2014：3686-3693.
[22] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[C]//13th European Conference on Computer Vision.Cham：Springer，2014：740-755.
[23] HOWARD A G，ZHU M，CHEN B，et al.MobileNets：efficient convolutional neural networks for mobile vision applications[J].arXiv：1704.04861，2017.
[24] TANG W，YU P，WU Y.Deeply learned compositional models for human pose estimation[C]//15th European Conference on Computer Vision，2018：190-206.