Face Reenactment Based on Unsupervised Motion Transfer and Video Correction

doi:10.3778/j.issn.1002-8331.2205-0293

Abstract

Abstract: Face reenactment aims to transfer the upper body motions from a driving actor to a target actor. Current methods either cannot transfer motion adequately or cannot synthesize high-quality video. This paper proposes a novel face reenactment method via unsupervised motion transfer and deep learning-based correction. Firstly, the motion of the driving actor is largely transferred to the target via an unsupervised motion model and a rough synthetic target video can be obtained. Then, a generative neural network with spatial-temporal structure is designed to correct the rough video to a realistic and smooth video. To synthesize smooth and detailed video, 3D convolution and attention mechanism are introduced into the network to process temporal information and guide the video correction. To avoid synthesizing background with artifacts, the background information is embedded into the network as fixed parameters. To improve the realism of the teeth, a mouth enhancement loss is designed. The network is trained in an adversarial manner, ensuring the realism of the generated images. Experiments show that this method can synthesize high-quality target videos and the performance is better than the current state-of-the-art face reenactment methods.

Key words: face reenactment, unsupervised learning, generative adversarial network, attention mechanism, 3D convolution

摘要： 人脸重演可以将一个驱动人物的上半身动作迁移到目标人物上，合成一段视频。针对当前方法动作迁移不充分或合成的视频质量较低的问题，提出了无监督动作迁移再修复的人脸重演方法。利用一种无监督运动迁移模型，将驱动人物动作较为完整地迁移到目标人物，并得到粗糙的目标人脸视频。然后设计一个带有时空结构的生成神经网络，将粗糙的人脸视频修正为逼真流畅的人脸视频。为合成流畅且细节丰富的视频，在网络中引入了三维卷积以及注意力机制，更好地处理时空信息和指导图片的修正；为避免背景合成错误，将背景信息嵌入到网络作为固定参数；为提高牙齿的真实度，设计了一种嘴部增强损失。该网络以对抗的方式训练，确保了图片的真实感。实验结果表明，该算法可合成高质量的目标人物视频，性能指标优于目前先进的重演方法。

关键词: 人脸重演, 无监督学习, 生成式对抗网络, 注意力机制, 三维卷积

CHEN Junbin, YANG Zhijing. Face Reenactment Based on Unsupervised Motion Transfer and Video Correction[J]. Computer Engineering and Applications, 2023, 59(19): 192-200.

陈俊彬, 杨志景. 无监督动作迁移再修复的人脸重演方法[J]. 计算机工程与应用, 2023, 59(19): 192-200.

References

[1] DOUKAS M C，KOUJAN M R，SHARMANSKA V，et al.Head2Head++：deep facial attributes re-targeting[J].IEEE Transactions on Biometrics，Behavior，and Identity Science，2021，3（1）：31-43.
[2] WANG T C，LIU M Y，ZHU J Y，et al.Video-to-video synthesis[C]//Proceedings of the Annual Conference on Neural Information Processing Systems，Montréal，Dec 3-8，2018：1152-1164.
[3] KIM H，GARRIDO P，TEWARI A，et al.Deep video portraits[J].ACM Transactions on Graphics，2018，37（4）：163.
[4] WILES O，KOEPKE A S，ZISSERMAN A.X2Face：a network for controlling face generation using images，audio，and pose codes[C]//Proceedings of the 15th European Conference on Computer Vision，Munich，Sep 8-14，2018.Cham：Springer，2018：670-686.
[5] SIAROHIN A，LATHUILIèRE S，TULYAKOV S，et al.Animating arbitrary objects via deep motion transfer[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，Jun 16-20，2019.Piscataway：IEEE，2019：2377-2386.
[6] SIAROHIN A，LATHUILIèRE S，TULYAKOV S，et al.First order motion model for image animation[C]//Proceedings of the Annual Conference on Neural Information Processing Systems，Vancouver，Dec 8-14，2019：7137-7147.
[7] GOODFELLOW I J，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2014，Montreal，Dec 8-13，2014：2672-2680.
[8] BLANZ V，VETTER T.A morphable model for the synthesis of 3D faces[C]//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques，1999：187-194.
[9] LIU M Y，HUANG X，YU J，et al.Generative adversarial networks for image and video synthesis：algorithms and applications[J].Proceedings of the IEEE，2021，109（5）：839-862.
[10] 米爱中，张伟，乔应旭，等.脸妆容迁移研究综述[J].计算机工程与应用，2022，58（2）：15-26.
MI A H，ZHANG W，QIAO Y X，et al.Review of research on facial makeup transfer[J].Computer Engineering and Applications，2022，58（2）：15-26.
[11] HAO J，LIU S，XU Q.Controlling eye blink for talking face generation via eye conversion[C]//Proceedings of the SIGGRAPH Asia 2021 Technical Communications，2021：1-4.
[12] FRIED O，TEWARI A，ZOLLHOFER M，et al.Text-based editing of talking-head video[J].ACM Transactions on Graphics（TOG），2019，38（4）：1-14.
[13] 胡晓瑞，林璟怡，李东，等.基于面部动作编码系统的表情生成对抗网络[J].计算机工程与应用，2020，56（18）：150-156.
HU X R，LIN J Y，LI D，et al.Facial expression generative adversarial networks based on facial action coding system[J].Computer Engineering and Applications，2020，56（18）：150-156.
[14] THIES J，ZOLLH?FER M，STAMMINGER M，et al.Face2Face：real-time face capture and reenactment of RGB videos[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，Jun 27-30，2016.Washington：IEEE Computer Society，2016：2387-2395.
[15] ZAKHAROV E，SHYSHEYA A，BURKOV E，et al.Few shot adversarial learning of realistic neural talking head models[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision，Seoul，Oct 27-Nov 2，2019.Piscataway：IEEE，2019：9459-9468.
[16] HE K，GKIOXARI G，DOLLáRP，et al.Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2961-2969.
[17] TRAN D，WANG H，TORRESANI L，et al.A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6450-6459.
[18] RONNEBERGER O，FISCHER P，BROX T.U-net：convolutional networks for biomedical image segmentation[C]//Proceedings of the International Conference on Medical image Computing and Computer-Assisted Intervention，2015：234-241.
[19] HU J，SHEN L，SUN G.Squeeze-and-excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[20] WANG T C，LIU M Y，ZHU J Y，et al.High-resolution image synthesis and semantic manipulation with conditional gans[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：8798-8807.
[21] MAO X，LI Q，XIE H，et al.Least squares generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2794-2802.
[22] KINGMA D P，BA J.ADAM：a method for stochastic optimization[J].arXiv：1412.6980.2014.
[23] HEUSEL M，RAMSAUER H，UNTERTHINER T，et al.Gans trained by a two time-scale update rule converge to a local nash equilibrium[C]//Advances in Neural Information Processing Systems，2017.
[24] UNTERTHINER T，VAN STEENKISTE S，KURACH K，et al.Towards accurate generative models of video：a new metric & challenges[J].arXiv：1812.01717，2018.