人像轮廓驱动下的姿态指导型实例分割

doi:10.3778/j.issn.1002-8331.2407-0334

摘要/Abstract

摘要： 针对人实例分割受困于背景环境的复杂多变、人物间的遮挡重叠等问题，以及传统单一任务的人实例分割网络在整合人体特征信息方面的不足，提出一种融合先验人像轮廓提取与姿态指导策略的实例分割方法，并构建了一个多任务学习网络架构。该多任务网络由先验处理模块、人体姿态估计模块、姿态指导型人像实例分割三部分组成。设计人像轮廓提取网络作为先验处理部分，来提取出人的大致轮廓，有效减轻背景混淆的干扰。针对附着人像轮廓的图像进行轮廓映射，充分捕捉人体的关键点信息，丰富分割过程中的结构线索，进一步提高处理遮挡与重叠情况的能力。将先验语义分割掩码与姿态指导实例分割生成的人实例分割掩码进行融合来提高分割精度。实验结果表明，该方法在多人人体姿态估计自底向上的方法中优于基线方法，在人像实例分割任务上的实验结果在平均精度上优于基线的姿态指导型实例分割网络3.4%。

关键词: 人像轮廓, 人体姿态估计, 人实例分割, 复杂背景, 多任务网络

Abstract: In response to the challenges faced by person instance segmentation, such as the complexity and variability of background environments, occlusions and overlaps between individuals, as well as the inadequacy of traditional single-task person instance segmentation networks in integrating human body feature information, a method for instance segmentation that integrates prior human contour extraction and pose-guided strategies is proposed. A multi-task learning network architecture is constructed for this purpose. The multi-task network consists of three modules: prior processing module, human body pose estimation module, and pose-guided person instance segmentation module. The design of a portrait contour extraction network serves as a prior processing component to delineate the approximate outline of human figures, effectively mitigating background interference. For images with attached human contours, contour mapping is employed to thoroughly capture key point information of the human body, enriching structural cues during the segmentation process and enhancing the capability to handle occlusions and overlaps. The integration of prior semantic segmentation masks with instance segmentation masks generated through pose-guided methods aims to improve segmentation accuracy. Experimental results demonstrate that this method outperforms baseline methods in bottom-up multi-person human body pose estimation. Furthermore, experimental results on person instance segmentation tasks show an average precision improvement of 3.4% compared to baseline pose-guided instance segmentation networks.

Key words: human contour, human pose estimation, human instance segmentation, complex background, multi-task network

马骏龙, 周军, 赵金叶, 李洋洋. 人像轮廓驱动下的姿态指导型实例分割[J]. 计算机工程与应用, 2025, 61(21): 253-264.

MA Junlong, ZHOU Jun, ZHAO Jinye, LI Yangyang. Pose-Guided Human Instance Segmentation Driven by Contour Prior[J]. Computer Engineering and Applications, 2025, 61(21): 253-264.

参考文献

[1] HAFIZ A M, BHAT G M. A survey on instance segmentation: state of the art[J]. International Journal of Multimedia Information Retrieval, 2020, 9(3): 171-189.
[2] SHARMA R, SAQIB M, LIN C T, et al. A survey on object instance segmentation[J]. SN Computer Science, 2022, 3(6): 499.
[3] ZHANG Y, YANG Q. A survey on multi-task learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(12): 5586-5609.
[4] LING Y T, WANG Y L, DAI W L, et al. MTANet: multi-task attention network for automatic medical image segmentation and classification[J]. IEEE Transactions on Medical Imaging, 2024, 43(2): 674-685.
[5] 张宇, 温光照, 米思娅, 等. 基于深度学习的二维人体姿态估计综述[J]. 软件学报, 2022, 33(11): 4173-4191.
ZHANG Y, WEN G Z, MI S Y, et al. Overview on 2D human pose estimation based on deep learning[J]. Journal of Software, 2022, 33(11): 4173-4191.
[6] TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014: 1653-1660.
[7] ZHENG C, WU W H, CHEN C, et al. Deep learning-based human pose estimation: a survey[J]. ACM Computing Surveys, 2024, 56(1): 1-37.
[8] 邓益侬, 罗健欣, 金凤林. 基于深度学习的人体姿态估计方法综述[J]. 计算机工程与应用, 2019, 55(19): 22-42.
DENG Y N, LUO J X, JIN F L. Overview of human pose estimation methods based on deep learning[J]. Computer Engineering and Applications, 2019, 55(19): 22-42.
[9] FANG H S, XIE S Q, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2353-2362.
[10] HE K M, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2980-2988.
[11] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[12] SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5686-5696.
[13] PISHCHULIN L, INSAFUTDINOV E, TANG S Y, et al. DeepCut: joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4929-4937.
[14] CAO Z, HIDALGO G, SIMON T, et al. OpenPose: realtime multi-person 2D pose estimation using part affinity fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1): 172-186.
[15] NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 483-499.
[16] BIN Y R, CAO X, CHEN X Y, et al. Adversarial semantic data augmentation for human pose estimation[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 606-622.
[17] KRICHEN M. Generative adversarial networks[C]//Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies. Piscataway: IEEE, 2023: 1-7.
[18] NIE X C, FENG J S, ZUO Y M, et al. Human pose estimation with parsing induced learner[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 2100-2108.
[19] 徐佳. 复杂场景下的人体姿态估计算法研究[D]. 北京: 北京交通大学, 2022.
XU J. Research of human pose estimation algorithm in complex scenarios[D]. Beijing: Beijing Jiaotong University, 2022.
[20] SUBARNA T, MAXWELL C, MATTHEW B, et al. Pose2-instance: Harnessing key-points for person instance segmentation[J].arXiv:1704.01152, 2017.
[21] PAPANDREOU G, ZHU T, CHEN L C, et al. PersonLab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]// Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 282-299.
[22] ZHANG S H, LI R L, DONG X, et al. Pose2Seg: detection free human instance segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 889-898.
[23] QIN X B, ZHANG Z C, HUANG C Y, et al. U2-Net: going deeper with nested U-structure for salient object detection[J]. Pattern Recognition, 2020, 106: 107404.
[24] CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5385-5394.
[25] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer International Publishing, 2014: 740-755.
[26] CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 833-851.
[27] KREISS S, BERTONI L, ALAHI A. PifPaf: composite fields for human pose estimation[J]. arXiv:1903.06593, 2019.
[28] NEWELL A, HUANG Z A, DENG J. Associative embedding: end-to-end learning for joint detection and grouping[J]. arXiv:1611.05424, 2016.
[29] CHNG Y X, ZHENG H, HAN Y Z, et al. Mask grounding for referring image segmentation[J]. arXiv:2312.12198, 2023.
[30] YANG Y C, QIAO Y, SUN X. Mask as supervision: leveraging unified mask information for unsupervised 3D pose estimation[J]. arXiv:2312.07051, 2023.
[31] RAFI U, DOERING A, LEIBE B, et al. Self-supervised keypoint correspondences for multi-person pose estimation and tracking in videos[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 36-52.