结合空间结构和纹理特征增强的人体姿态迁移

doi:10.3778/j.issn.1002-8331.2412-0451

摘要/Abstract

摘要： 由姿态引导的人像合成是图像生成中一个具有挑战性的前沿领域。提出了一个新的网络结构——Step网络，用于克服以前工作中发现的局限性。与传统方法的不同之处在于，它专注于姿态的空间结构，使姿态的逐渐迁移成为可能，同时最大限度地减少每一步空间结构信息的损失。并且从三元损失中获得灵感，加入了风格判别器来提升纹理生成的质量。此外，与之前的研究相比，更加强调面部区域的生成。为了实现这一点，训练过程中采用了一种专门的损失函数，结合了三元损失和L1损失来优化面部特征，从而使图像更符合人类的感知。为了评估生成图像的质量，使用了PSNR、SSIM、FID和LPIPS等评估指标。通过将Step网络与最先进的模型进行定性和定量实验比较，证实了它的优越性。具体来说，该模型训练得到的PSNR为18.037 6，SSIM为0.768 6，FID为10.810 2，LPIPS为0.166 5。

关键词: 人体姿态迁移, 图像生成, 生成对抗网络（GAN）, 深度神经网络

Abstract: Portrait synthesis guided by pose presents a challenging frontier in image generation. In the latest research, this paper has proposed an innovative network called the Step network, specifically designed to overcome the limitations identified in previous works. This approach differs from traditional methods by honing in on the spatial structure of the pose, enabling a gradual migration of the pose while minimizing the loss of spatial structure information at each step. Moreover, drawing inspiration from the triplet loss, a style discriminator is incorporated to enhance the quality of texture generation. In contrast to prior research, the paper has placed greater emphasis on refining the generation of facial areas. To achieve this, a specialized loss function is employed during the training process, combining triplet loss and L1 loss to optimize facial features, resulting in images that are more aligned with human perception. To evaluate the quality of the generated images, the paper utilizes evaluation metrics such as PSNR, SSIM, FID, and LPIPS. Through both qualitative and quantitative experiments comparing the approach with state-of-the-art models, the paper has demonstrated significant improvements across these metrics, confirming its superiority. Specifically, this method achieves a PSNR of 18.037 6, SSIM of 0.768 6, FID of 10.810 2, and LPIPS of 0.166 5.

Key words: human pose transfer, image generation, generative adversarial network(GAN), deep neural network

莫寒, 徐杨, 冯明文. 结合空间结构和纹理特征增强的人体姿态迁移[J]. 计算机工程与应用, 2025, 61(11): 259-271.

MO Han, XU Yang, FENG Mingwen. Integrating Spatial Structure and Texture Features for Enhanced Human Pose Transfer[J]. Computer Engineering and Applications, 2025, 61(11): 259-271.

参考文献

[1] LIANG C, ZHANG Z P, ZHOU X, et al. Rethinking the competition between detection and ReID in multiobject tracking[J]. IEEE Transactions on Image Processing, 2022, 31: 3182-3196.
[2] GE Y Y, SONG Y B, ZHANG R M, et al. Parser-free virtual try-on via distilling appearance flows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 8481-8489.
[3] ZHU Z, HUANG T T, SHI B G, et al. Progressive pose attention transfer for person image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 2342-2351.
[4] MEN Y F, MAO Y M, JIANG Y N, et al. Controllable person image synthesis with attribute-decomposed GAN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5083-5092.
[5] ZHANG J S, LI K, LAI Y K, et al. PISE: person image synthesis and editing with decoupled GAN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 7978-7986.
[6] ZHU Q, HUANG K L, ZHANG Z, et al. CrossWOZ: a large-scale Chinese cross-domain task-oriented dialogue dataset[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 281-295.
[7] ZHANG Q, YANG Y B. Rest: an efficient transformer for visual recognition[C]//Advances in Neural Information Processing Systems, 2021: 15475-15485.
[8] LI J, WANG Y B, WANG C G, et al. DSFD: dual shot face detector[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5055-5064.
[9] SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 815-823.
[10] LIANG D W, KRISHNAN R G, HOFFMAN M D, et al. Variational autoencoders for collaborative filtering[C]//Proceedings of the World Wide Web Conference on World Wide Web. New York: ACM, 2018: 689-698.
[11] ISOLA P, ZHU J Y, ZHOU T H, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5967-5976.
[12] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the Medical Image Computing and Computer-Assisted Intervention, 2015: 234-241.
[13] ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2242-2251.
[14] KARRAS T, LAINE S, AITTALA M, et al. Analyzing and improving the image quality of StyleGAN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 8107-8116.
[15] ZHOU D L, ZHANG H J, LI Q, et al. COutfitGAN: learning to synthesize compatible outfits supervised by silhouette masks and fashion styles[J]. IEEE Transactions on Multimedia, 2023, 25: 4986-5001.
[16] 邓梓焌, 何相腾, 彭宇新. 文本到视频生成: 研究现状、进展和挑战[J]. 电子与信息学报, 2024, 46(5): 1632-1644.
DENG Z J, HE X T, PENG Y X. Text-to-video generation: research status, progress and challenges[J]. Journal of Electronics & Information Technology, 2024, 46(5): 1632-1644.
[17] 姜友鹏, 华阳, 宋晓宁. 空间注意力与位置优化的三维人体姿态估计域适应算法[J]. 计算机科学与探索, 2024, 18(9): 2384-2394.
JIANG Y P, HUA Y, SONG X N. Domain adaptation algorithm for 3D human pose estimation with spatial attention and position optimization[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(9): 2384-2394.
[18] MA L, JIA X, SUN Q, et al. Pose guided person image generation[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 405-415.
[19] HUANG X, BELONGIE S. Arbitrary style transfer in real-time with adaptive instance normalization[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 1510-1519.
[20] DUFOUR N, PICARD D, KALOGEITON V. SCAM! transferring humans between images withSemantic cross attention modulation[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 713-729.
[21] LI N N, SHIH K J, PLUMMER B A. Collecting the puzzle pieces: disentangled self-driven human pose transfer by permuting textures[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 7092-7103.
[22] TANG H, BAI S, ZHANG L, et al. XingGAN for person image generation[C]///Proceedings of the European Conference on Computer Vision, 2020: 717-734.
[23] LI K, ZHANG J S, LIU Y B, et al. PoNA: pose-guided non-local attention for human pose transfer[J]. IEEE Transactions on Image Processing, 2020, 29: 9584-9599.
[24] CHEONG S Y, MUSTAFA A, GILBERT A. UPGPT: universal diffusion model for person image generation, editing and pose transfer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2023: 4175-4184.
[25] ZHOU X Y, YIN M Y, CHEN X Y, et al. Cross attention based style distribution for Controllable person image synthesis[C]//Proceedings of the European Conference on Computer Vision, 2022: 161-178.
[26] GULER R A, NEVEROVA N, KOKKINOS I. DensePose: dense human pose estimation in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7297-7306.
[27] BALLé J, LAPARRA V, SIMONCELLI E P. End-to-end optimized image compression[J]. arXiv:1611.01704, 2016.
[28] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[29] ZHANG S F, ZHU X Y, LEI Z, et al. FaceBoxes: a CPU real-time face detector with high accuracy[C]//Proceedings of the IEEE International Joint Conference on Biometrics. Piscataway: IEEE, 2017: 1-9.
[30] LIU Z W, LUO P, QIU S, et al. DeepFashion: powering robust clothes recognition and retrieval with rich annotations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 1096-1104.
[31] ZHANG F, SHI Q X, MA Y L. Combining self-attention and depth-wise convolution for human pose estimation[J]. Signal, Image and Video Processing, 2024, 18(8): 5647-5661.
[32] GONG K, LIANG X D, LI Y C, et al. Instance-level human parsing via part grouping network[J]. arXiv:1808.00157, 2018.
[33] ZHANG J S, LIU X Z, LI K. Human pose transfer by adaptive hierarchical deformation[J]. Computer Graphics Forum, 2020, 39(7): 325-337.
[34] CUI A Y, MCKEE D, LAZEBNIK S. Dressing in order: recurrent person image generation for pose transfer, virtual try-on and outfit editing[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 14618-14627.