Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (19): 192-200.DOI: 10.3778/j.issn.1002-8331.2205-0293

• Graphics and Image Processing • Previous Articles     Next Articles

Face Reenactment Based on Unsupervised Motion Transfer and Video Correction

CHEN Junbin, YANG Zhijing   

  1. School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
  • Online:2023-10-01 Published:2023-10-01



  1. 广东工业大学 信息工程学院,广州 510006

Abstract: Face reenactment aims to transfer the upper body motions from a driving actor to a target actor. Current methods either cannot transfer motion adequately or cannot synthesize high-quality video. This paper proposes a novel face reenactment method via unsupervised motion transfer and deep learning-based correction. Firstly, the motion of the driving actor is largely transferred to the target via an unsupervised motion model and a rough synthetic target video can be obtained. Then, a generative neural network with spatial-temporal structure is designed to correct the rough video to a realistic and smooth video. To synthesize smooth and detailed video, 3D convolution and attention mechanism are introduced into the network to process temporal information and guide the video correction. To avoid synthesizing background with artifacts, the background information is embedded into the network as fixed parameters. To improve the realism of the teeth, a mouth enhancement loss is designed. The network is trained in an adversarial manner, ensuring the realism of the generated images. Experiments show that this method can synthesize high-quality target videos and the performance is better than the current state-of-the-art face reenactment methods.

Key words: face reenactment, unsupervised learning, generative adversarial network, attention mechanism, 3D convolution

摘要: 人脸重演可以将一个驱动人物的上半身动作迁移到目标人物上,合成一段视频。针对当前方法动作迁移不充分或合成的视频质量较低的问题,提出了无监督动作迁移再修复的人脸重演方法。利用一种无监督运动迁移模型,将驱动人物动作较为完整地迁移到目标人物,并得到粗糙的目标人脸视频。然后设计一个带有时空结构的生成神经网络,将粗糙的人脸视频修正为逼真流畅的人脸视频。为合成流畅且细节丰富的视频,在网络中引入了三维卷积以及注意力机制,更好地处理时空信息和指导图片的修正;为避免背景合成错误,将背景信息嵌入到网络作为固定参数;为提高牙齿的真实度,设计了一种嘴部增强损失。该网络以对抗的方式训练,确保了图片的真实感。实验结果表明,该算法可合成高质量的目标人物视频,性能指标优于目前先进的重演方法。

关键词: 人脸重演, 无监督学习, 生成式对抗网络, 注意力机制, 三维卷积