Human Pose Estimation Based on Dual-Stream Fusion of CNN and Transformer

doi:10.3778/j.issn.1002-8331.2406-0076

Abstract

Abstract: Convolutional neural network (CNN) and Transformer models are widely used in human pose estimation. However, Transformer focuses more on capturing the global features of images, and it overlooks the importance of local features for detailed human pose estimation. Conversely, CNN lacks the global modeling capabilities of Transformer. To fully leverage the strengths of CNN in processing local information and Transformer in capturing global information, this paper proposes a CNN-Transformer dual-flow parallel network architecture to aggregate rich feature information. Traditional Transformer requires flattening images into multiple patches, which is detrimental to extracting position-sensitive human structural information. Therefore, the multi-head attention structure is improved in this paper, so that the model input can maintain the structure of the original 2D feature map. Additionally, a feature coupling module is introduced to fuse features from different resolutions of the two branches, maximizing the retention of both local features and global features.Finally, an improved coordinate attention module is incorporated to further enhance the network’s feature extraction capability. Experimental results on COCO and MPII datasets demonstrate that the proposed model achieves higher detection accuracy compared to current mainstream models, which indicates that the proposed model can effectively capture and integrate both local and global features in the human pose.

Key words: convolutional neural network(CNN), Transformer, local feature, global feature, 2D feature map, feature coupling

摘要： 卷积神经网络（CNN）和Transformer模型在人体姿态估计中有着广泛应用，然而Transformer更注重捕获图像的全局特征，忽视了局部特征对于人体姿态细节的重要性，而CNN则缺乏Transformer的全局建模能力。为了充分利用CNN处理局部信息和Transformer处理全局信息的优势，构建一种CNN-Transformer双流的并行网络架构来聚合丰富的特征信息。由于传统Transformer的输入需要将图片展平为多个patch，不利于提取对位置敏感的人体结构信息，因此将其多头注意力结构进行改进，使模型输入能够保持原始2D特征图的结构；同时提出特征耦合模块融合两个分支不同分辨率下的特征，最大限度地保留局部特征与全局特征；最后引入改进后的坐标注意力模块（coordinate attention），进一步提升网络的特征提取能力。在COCO和MPII数据集上的实验结果表明所提模型相对目前主流模型具有更高的检测精度，从而说明所提模型能够充分捕获并融合人体姿态中的局部和全局特征。

关键词: 卷积神经网络, Transformer, 局部特征, 全局特征, 2D特征图, 特征耦合

LI Xin, ZHANG Dan, GUO Xin, WANG Song, CHEN Enqing. Human Pose Estimation Based on Dual-Stream Fusion of CNN and Transformer[J]. Computer Engineering and Applications, 2025, 61(5): 187-199.

李鑫, 张丹, 郭新, 汪松, 陈恩庆. 基于CNN和Transformer双流融合的人体姿态估计[J]. 计算机工程与应用, 2025, 61(5): 187-199.

References

[1] MARCOS-RAMIRO A, PIZARRO D, MARRON-ROMERA M, et al. Let your body speak: communicative cue extraction on natural interaction using RGBD data[J]. IEEE Transactions on Multimedia, 2015, 17(10): 1721-1732.
[2] ELKHOLY A, HUSSEIN M E, GOMAA W, et al. Efficient and robust skeleton-based quality assessment and abnormality detection in human action performance[J]. IEEE Journal of Biomedical and Health Informatics, 2019, 24(1): 280-291.
[3] 甄昊宇, 张德. 结合自适应图卷积与时态建模的骨架动作识别[J]. 计算机工程与应用, 2023, 59(18): 137-144.
ZHEN H Y, ZHANG D. Combining adaptive graph convolution and temporal modeling for skeleton-based action recognition[J]. Computer Engineering and Applications, 2023, 59(18): 137-144.
[4] ANDRILUKA M, IQBAL U, INSAFUTDINOV E, et al. Posetrack: a benchmark for human pose estimation and tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 5167-5176.
[5] 李博. 改进型深度迁移学习的跨镜行人追踪算法[J]. 计算机工程与应用, 2021, 57(10): 110-116.
LI B. Improved deep transfer learning algorithm for person re-identification[J]. Computer Engineering and Applications, 2021, 57(10): 110-116.
[6] 马金林, 崔琦磊, 马自萍, 等. 预加权调制密集图卷积网络三维人体姿态估计[J]. 计算机科学与探索, 2024, 18(4): 963-977.
MA J L, CUI Q L, MA Z P, et al. Pre-weighted modulated dense graph convolutional networks for 3D human pose estimation[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(4): 963-977.
[7] 王仕宸, 黄凯, 陈志刚, 等. 深度学习的三维人体姿态估计综述[J]. 计算机科学与探索, 2023, 17(1): 74-87.
WANG S C, HUANG K, CHEN Z G, et al. Survey on 3D human pose estimation of deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 74-87.
[8] 杨旭升, 吴江宇, 胡佛, 等. 基于渐进高斯滤波融合的多视角人体姿态估计[J]. 自动化学报, 2024, ?50(3): 607-616.
YANG X S, WU J Y, HU F, et al. Multi-view human pose estimation based on progressive Gaussian filtering fusion[J]. Acta Automatica Sinica, 2024, ?50(3): 607-616.
[9] ROGEZ G, RIHAN J, RAMALINGAM S, et al. Randomized trees for human pose detection[C]//Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008: 1-8.
[10] URTASUN R, DARRELL T. Local probabilistic regression for activity-independent human pose inference[C]//Proceedings of the IEEE Conference on?Computer Vision and Pattern Recognition, 2008.
[11] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014: 1653-1660.
[12] PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose estimation in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4903-4911.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000 - 6010.
[14] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[15] XU Y, ZHANG J, ZHANG Q, et al. ViTPose: simple vision transformer baselines for human pose estimation[C]//Advances in Neural Information Processing Systems: 2022: 38571-38584.
[16] MAO W, GE Y, SHEN C, et al. Poseur: direct human pose regression with transformers[C]//Proceedings of the European Conference on Computer Vision, 2022: 72-88.
[17] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[18] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1-9.
[19] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision, 2020: 213-229.
[20] WU B, XU C, DAI X, et al. Visual transformers: token-based image representation and processing for computer vision[J]. arXiv:2006.03677, 2020.
[21] 邓益侬, 罗健欣, 金凤林. 基于深度学习的人体姿态估计方法综述[J]. 计算机工程与应用, 2019, 55(19): 22-42.
DENG Y N, LUO J X, JIN F L. Overview of human pose estimation methods based on deep learning[J]. Computer Engineering and Applications, 2019, 55(19): 22-42.
[22] 周燕, 刘紫琴, 曾凡智, 等. 深度学习的二维人体姿态估计综述[J]. 计算机科学与探索, 2021, 15(4): 641-657.
ZHOU Y, LIU Z Q, ZENG F Z, et al. Survey on two-dimensional human pose estimation of deep learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(4): 641-657.
[23] ZHENG C, MENDIETA M, YANG T, et al. Feater: an efficient network for human reconstruction via feature map-based transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 13945-13954.
[24] WANG C Y, LIAO H Y M, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 390-391.
[25] NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 483-499.
[26] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7103-7112.
[27] XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking[C]//Proceedings of the European Conference on Computer Vision, 2018: 466-481.
[28] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5693-5703.
[29] XIONG Z, WANG C, LI Y, et al. Swin-pose: swin transformer based human pose estimation[C]//Proceedings of the 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval, 2022: 228-233.
[30] LI K, WANG S, ZHANG X, et al. Pose recognition with cascade transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 1944-1953.
[31] YUAN Y, FU R, HUANG L, et al. HRFormer: high-resolution vision transformer for dense predict[C]//Advances in Neural Information Processing Systems, 2021: 7281-7293.
[32] YANG S, QUAN Z, NIE M, et al. TransPose: keypoint localization via transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 11802-11812.
[33] LI Y, ZHANG S, WANG Z, et al. TokenPose: learning keypoint tokens for human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 11313-11322.
[34] MAO W, GE Y, SHEN C, et al. TFPose: direct human pose estimation with transformers[J]. arXiv:2103.15320, 2021.
[35] 江春灵, 曾碧, 姚壮泽, 等. 融合权重自适应损失和注意力的人体姿态估计[J]. 计算机工程与应用, 2023, 59(18): 145-153.
JIANG C L, ZENG B, YAO Z Z, et al. Human pose estimation fusing weight adaptive loss and attention[J]. Computer Engineering and Applications, 2023, 59(18): 145-153.
[36] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[37] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision, 2018: 3-19.
[38] HOU Q, ZHOU D, FENG J. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13713-13722.
[39] XU W, WAN Y. ELA: efficient local attention for deep convolutional neural networks[J]. arXiv:2403.01123, 2024.
[40] YOO J, KIM T, LEE S, et al. Enriched CNN-transformer feature aggregation networks for super-resolution[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023: 4956-4965.
[41] WU Y, HE K. Group normalization[C]//Proceedings of the European Conference on Computer Vision, 2018: 3-19.
[42] ZHANG F, ZHU X, DAI H, et al. Distribution-aware coordinate representation for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 7093-7102.
[43] SUN X, ADAMU M J, ZHANG R, et al. Pixel-coordinate-induced human pose high-precision estimation method[J]. Electronics, 2023, 12(7): 1648.
[44] 高坤, 李汪根, 束阳, 等. 融入密集连接的多尺度轻量级人体姿态估计[J]. 计算机工程与应用, 2022, 58(24): 196-204.
GAO K, LI W G, SHU Y, et al. Multi-scale lightweight human pose estimation with dense connections[J]. Computer Engineering and Applications, 2022, 58(24): 196-204.
[45] GENG Z, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14676-14686.
[46] XU J, LIU W, XING W, et al. MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation[J]. The Visual Computer, 2023, 39(5): 2005-2019.
[47] DONG K, SUN Y, CHENG X, et al. Combining detailed appearance and multi-scale representation: a structure-context complementary network for human pose estimation[J]. Applied Intelligence, 2023, 53(7): 8097-8113.
[48] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 568-578.
[49] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.