计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (18): 122-129.DOI: 10.3778/j.issn.1002-8331.2101-0318

• 生成对抗网络专题 • 上一篇    下一篇

脸由音生:语音驱动的静动态人脸生成方法

赵璐璐,陈雁翔,赵鹏铖,朱玉鹏,盛振涛   

  1. 合肥工业大学 计算机与信息学院,合肥 230009
  • 出版日期:2022-09-15 发布日期:2022-09-15

Generating Face from Voice:Method of Voice-Driven Static and Dynamic Face Generation

ZHAO Lulu, CHEN Yanxiang, ZHAO Pengcheng, ZHU Yupeng, SHENG Zhentao   

  1. School of Computer and Information, Hefei University of Technology, Hefei 230009, China
  • Online:2022-09-15 Published:2022-09-15

摘要: 语音驱动人脸生成旨在挖掘语音片段和人脸之间的静动态关联性,进而由给定的语音片段生成对应的人脸图像。然而已有的研究方法大多只考虑其中的一种关联性,且对静态人脸生成的研究严格依赖于时序对齐的音视频数据,在一定程度上限制了静态模型的使用范围。提出了一种基于条件生成对抗网络的语音驱动静动态人脸生成模型(SDVF-GAN)。该模型基于自注意力机制构建语音编码器网络以获得更为准确的听觉特征表达,并将其作为静态生成网络和动态生成网络的输入;静态生成网络利用基于投影层的图像判别器合成出属性一致(年龄、性别)且高质量的静态人脸图像,动态生成网络利用基于注意力思想的嘴唇判别器和图像判别器合成出嘴唇同步的动态人脸序列。实验利用所构建的属性对齐的Voice-Face数据集和公共的LRW数据集分别训练静态人脸生成网络和动态人脸生成网络。结果表明,该模型综合研究了语音和人脸之间的属性对应和嘴唇同步关系,实现了质量更高且关联性和同步性更强的人脸图像生成。

关键词: 语音驱动, 静动态人脸生成, 属性关联, 嘴唇同步, 生成对抗网络

Abstract: Voice-driven face generation aims to explore the static and dynamic correlation between voice fragments and faces, so that it can generate corresponding face images from a given voice fragment. However, most of the existing research only consider one of the correlations. In addition, the methods on static face generation rely on time-aligned audio-visual data strictly, which limits the use of such static models to a certain extent. Therefore, a voice-driven static and dynamic face generation model(SDVF-GAN) is proposed based on conditional generative adversarial networks. SDVF-GAN builds a voice encoder network, which obtains more accurate auditory feature through the self-attention mechanism. Both the static and dynamic generation network takes these auditory features as input. For static generation network, it uses image discriminator based on the projection layer, which ensures it can synthesize static face images with consistent attributes(age, gender) and high quality. The dynamic generation network uses the image discriminator and attention-based lip discriminator to generate a sequence of dynamic face image with lip synchronization. In experiment, the authors constructs an attribute-aligned Voice-Face dataset to optimize the parameters of static model and uses the existing LRW dataset to train the dynamic model. The results demonstrate that model studies the attribute correspondence and lip synchronization relationship between voice and face comprehensively, it can generate face images with higher quality and stronger correlation and synchronization.

Key words: voice-driven, static and dynamic face generation, consistent attributes, lip synchronization, generative adversarial networks