融合Transformer和注意力的轻量高效人体姿态估计

doi:10.3778/j.issn.1002-8331.2401-0173

摘要/Abstract

摘要： 针对人体姿态估计算法的沉重计算成本和庞大网络规模问题，提出面向人体姿态估计的轻量级高效视觉变换器（lightweight efficient vision transformer for human posture estimation，LEViTPose）。引入深度可分离卷积、通道混洗和多尺度卷积核并行技术来设计轻量级预处理模块LStem；提出一种级联组空间线性退化注意力（cascaded group spatial linear reduction attention，CGSLRA），采用特征分组划分多个注意头的方式来提升内存存储效率，采用组内特征降维来降低计算冗余；通过引入逐点卷积和分组反卷积来设计轻量级特征还原模块（lightweight feature recovery module，LFRM）。实验结果表明，所提方法相比基线模型，可以在提升网络性能和推理速度的同时降低网络规模和计算开销。在MPII和COCO验证集上与LiteHRNet-30相比，平均准确率分别提高了2.6和3.4个百分点，推理速度提升了1倍。

关键词: 人体姿态估计, 轻量级网络, 注意力机制, Transformer

Abstract: Aiming at the heavy computational cost and huge network scale problem of human posture estimation algorithms, lightweight efficient vision transformer for human posture estimation (LEViTPose) is proposed. Firstly, a lightweight preprocessing module LStem is designed by introducing deepwise separable convolution, channel shuffle and multi-scale convolution kernel parallel techniques. Then, a cascaded group spatial linear reduction attention (CGSLRA) is proposed, which uses feature grouping to divide multiple attention heads to improve memory efficiency, and uses intra-group feature dimension reduction to reduce computational redundancy. Finally, a lightweight feature recovery module (LFRM) is designed by introducing pointwise convolution and group transposed convolution. The experimental results show that the proposed method can improve the network performance and inference speed while reducing the network size and computational overhead compared to the baseline model. Compared with LiteHRNet-30 on the MPII and COCO validation sets, the average accuracy is improved by 2.6 and 3.4 percentage points, and the inference speed is increased by a factor of 1.

Key words: human pose estimation, lightweight network, attention mechanism, Transformer

吴程鹏, 谭光兴, 陈海峰, 李春宇. 融合Transformer和注意力的轻量高效人体姿态估计[J]. 计算机工程与应用, 2024, 60(22): 197-208.

WU Chengpeng, TAN Guangxing, CHEN Haifeng, LI Chunyu. Lightweight and Efficient Human Pose Estimation Fusing Transformer and Attention[J]. Computer Engineering and Applications, 2024, 60(22): 197-208.

参考文献

[1] DUAN H D, ZHAO Y, CHEN K, et al. Revisiting skeleton-based action recognition[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 2969-2978.
[2] KHAN M A, JAVED K, KHAN S A, et al. Human action recog-
nition using fusion of multiview and deep features: an application to video surveillance[J]. Multimedia Tools and Applications, 2024, 83(5): 14885-14911.
[3] FANG Z J, LóPEZ A M. Intention recognition of pedestrians and cyclists by 2D pose estimation[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(11): 4773-4783.
[4] LU M Q, HU Y C, LU X B. Driver action recognition using deformable and dilated faster R-CNN with optimized region proposals[J]. Applied Intelligence, 2020, 50: 1100-1111.
[5] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1653-1660.
[6] XIAO B, WU H P, WEI Y H. Simple baselines for human pose estimation and tracking[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 466-481.
[7] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5693-5703.
[8] ZHANG W Q, FANG J M, WANG X G, et al. EfficientPose: efficient human pose estimation with neural architecture search[J]. Computational Visual Media, 2021, 7: 335-347.
[9] ZHANG Z, TANG J, WU G S. Simple and lightweight human pose estimation[J]. arXiv:1911.10346, 2019.
[10] YU C Q, XIAO B, GAO C X, et al. Lite-HRNet: a lightweight high-resolution network[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 10440-10450.
[11] LIU X Y, PENG H W, ZHENG N X, et al. EfficientViT: memory efficient vision transformer with cascaded group attention[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 14420-14430.
[12] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[13] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1-9.
[14] WU C P, TAN G X, LI C Y. HEViTPose: high-efficiency vision transformer for human pose estimation[J]. arXiv:2311.13615, 2023.
[15] MA N, ZHANG X, ZHENG H T, et al. ShuffleNet v2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 116-131.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017.
[17] WANG W H, XIE E Z, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 568-578.
[18] ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 3686-3693.
[19] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, 2014: 740-755.
[20] GENG Z G, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14676-14686.
[21] ARANI E, GOWDA S, MUKHERJEE R, et al. A comprehensive study of real-time object detection networks across multiple domains: a survey[J]. arXiv:2208.10895, 2022.
[22] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[23] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25, 2012.
[24] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.
1556, 2014.
[25] CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7103-7112.
[26] IANDOLA F N, HAN S, MOSKEWICZ M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size[J]. arXiv:1602.07360, 2016.
[27] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[28] HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[29] XIE S N, GIRSHICK R, DOLLáR P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1492-1500.
[30] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[31] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[32] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 10347-10357.
[33] CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1251-1258.
[34] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2117-2125.
[35] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848.
[36] BA J L, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016.
[37] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, 2016: 483-499.
[38] ZHANG H, WU C R, ZHANG Z G, et al. ResNeSt: split-attention networks[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 2736-2746.
[39] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetv2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4510-4520.
[40] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[41] XU Y F, ZHANG J, ZHANG Q M, et al. ViTPose: simple vision transformer baselines for human pose estimation[C]//Advances in Neural Information Processing Systems 35, 2022: 38571-38584.
[42] 高坤, 李汪根, 束阳, 等. 融入密集连接的多尺度轻量级人体姿态估计[J]. 计算机工程与应用, 2022, 58(24): 196-204.
GAO K, LI W G, SHU Y, et al. Multi-scale lightweight human pose estimation with dense connections[J]. Computer Engineering and Applications, 2022, 58(24): 196-204.
[43] SUN X, XIAO B, WEI F Y, et al. Integral human pose regression[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 529-545.