脸由音生：语音驱动的静动态人脸生成方法

doi:10.3778/j.issn.1002-8331.2101-0318

摘要/Abstract

摘要： 语音驱动人脸生成旨在挖掘语音片段和人脸之间的静动态关联性，进而由给定的语音片段生成对应的人脸图像。然而已有的研究方法大多只考虑其中的一种关联性，且对静态人脸生成的研究严格依赖于时序对齐的音视频数据，在一定程度上限制了静态模型的使用范围。提出了一种基于条件生成对抗网络的语音驱动静动态人脸生成模型（SDVF-GAN）。该模型基于自注意力机制构建语音编码器网络以获得更为准确的听觉特征表达，并将其作为静态生成网络和动态生成网络的输入；静态生成网络利用基于投影层的图像判别器合成出属性一致（年龄、性别）且高质量的静态人脸图像，动态生成网络利用基于注意力思想的嘴唇判别器和图像判别器合成出嘴唇同步的动态人脸序列。实验利用所构建的属性对齐的Voice-Face数据集和公共的LRW数据集分别训练静态人脸生成网络和动态人脸生成网络。结果表明，该模型综合研究了语音和人脸之间的属性对应和嘴唇同步关系，实现了质量更高且关联性和同步性更强的人脸图像生成。

关键词: 语音驱动, 静动态人脸生成, 属性关联, 嘴唇同步, 生成对抗网络

Abstract: Voice-driven face generation aims to explore the static and dynamic correlation between voice fragments and faces, so that it can generate corresponding face images from a given voice fragment. However, most of the existing research only consider one of the correlations. In addition, the methods on static face generation rely on time-aligned audio-visual data strictly, which limits the use of such static models to a certain extent. Therefore, a voice-driven static and dynamic face generation model（SDVF-GAN） is proposed based on conditional generative adversarial networks. SDVF-GAN builds a voice encoder network, which obtains more accurate auditory feature through the self-attention mechanism. Both the static and dynamic generation network takes these auditory features as input. For static generation network, it uses image discriminator based on the projection layer, which ensures it can synthesize static face images with consistent attributes（age, gender） and high quality. The dynamic generation network uses the image discriminator and attention-based lip discriminator to generate a sequence of dynamic face image with lip synchronization. In experiment, the authors constructs an attribute-aligned Voice-Face dataset to optimize the parameters of static model and uses the existing LRW dataset to train the dynamic model. The results demonstrate that model studies the attribute correspondence and lip synchronization relationship between voice and face comprehensively, it can generate face images with higher quality and stronger correlation and synchronization.

Key words: voice-driven, static and dynamic face generation, consistent attributes, lip synchronization, generative adversarial networks

赵璐璐, 陈雁翔, 赵鹏铖, 朱玉鹏, 盛振涛. 脸由音生：语音驱动的静动态人脸生成方法[J]. 计算机工程与应用, 2022, 58(18): 122-129.

ZHAO Lulu, CHEN Yanxiang, ZHAO Pengcheng, ZHU Yupeng, SHENG Zhentao. Generating Face from Voice：Method of Voice-Driven Static and Dynamic Face Generation[J]. Computer Engineering and Applications, 2022, 58(18): 122-129.

参考文献

[1] OH T，DEKEL T，KIM C，et al.Speech2Face：learning the face behind a voice[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，Jun15-Jun 20，2019：7531-7540.
[2] JAMALUDIN A，CHUNG J S，ZISSERMAN A.You said that?：synthesising talking faces from audio[J].International Journal of Computer Vision，2019，127：1767-1779.
[3] SUWAJANAKORN S，SEITZ S M，KEMELMACHER-SHLIZERMAN I.Synthesizing obama：learning lip sync from audio[J].ACM Transactions on Graphics，2017，36（4）：1-13.
[4] AYTAR Y，VONDRICK C，TORRALBA A.SoundNet：learning sound representations from unlabeled video[J].arXiv：1610.09001，2016.
[5] CHEN L L，SRIVASTAVA S，DUAN Z Y，et al.Deep cross-modal audio-visual generation[C]//Proceedings of the on Thematic Workshops of ACM Multimedia，Mountain View，Oct 2017.New York：ACM，2017：349-357.
[6] HU D，WANG D，LI X L，et al.Listen to the image[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，Jun15-Jun 20，2019：7964-7973.
[7] DUARTE A，ROLDAN F，TUBAU M，et al.Wav2Pix：speech-conditioned face generation using generative adversarial networks[C]//Proceedings of the IEEE International Conference on Acoustics，Speech and Signal Processing，Brighton，May 12-May 17，2019：8633-8637.
[8] WEN Y D，SINGH R，RAJ B.Reconstructing faces from voices[J].arXiv：1905.10604，2019.
[9] WILES O，KOEPKE A S，ZISSERMAN A.X2Face：a network for controlling face generation by using images，audio，and pose codes[C]//Proceedings of the European Conference on Computer Vision，2018：690-706.
[10] ZHOU H，LIU Y，LIU Z W，et al.Talking face generation by adversarially disentangled audio-visual representation[J].arXiv：1807.07860，2018.
[11] CHEN L L，MADDOX R K，DUAN Z Y，et al.Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Long Beach，Jun15-Jun 20，2019：7832-7841.
[12] ZHANG H，GOODFELLOW I，METAXAS D，et al.Self-attention generative adversarial networks[J].arXiv：1805.
08318，2018.
[13] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems，Montreal，Dec 2014.Cambridge：ACM，2014：2672-2680.
[14] CHEN B C，CHEN C S，HSU W H.Cross-age reference coding for age-invariant face recognition and retrieval[C]//Proceedings of the European Conference on Computer Vision，2014：768-783.
[15] BULAT A，TZIMIROPOULOS G.How far are we from solving the 2D & 3D face alignment problem? （and a dataset of 230，000 3D facial landmarks）[C]//Proceedings of the IEEE International Conference on Computer Vision，Venice，Oct 22-Oct 29，2017：1021-1030.
[16] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems，Long Beach，Dec 2017.Red Hook：ACM，2017：6000-6010.
[17] MIRZA M，OSINDERO S.Conditional generative adversarial nets[J].arXiv：1411.1784，2014.
[18] MIYATO T，KOYAMA M.cGANs with projection discriminator[J].arXiv：1802.05637，2018.
[19] CHUNG J S，ZISSERMANA.Lip reading in the wild[C]//Proceedings of Asian Conference on Computer Vision，2016：87-103.
[20] RONNEBERGER O，FISCHER P，BROX T.U-net：convolutional networks for biomedical image segmentation[C]//Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention，2015：234-241.
[21] KINGMA D，BA J.Adam：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[22] WANG Z，BOVIK A C，SHEIKN H R，et al.Image quality assessment：from error visibility to structural similarity[J].IEEE Transactions on Image Processing，2004，13（4）：600-612.
[23] CHEN L L，LI Z H，MADDOX R K，et al.Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision，2018：538-553.