计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (17): 33-46.DOI: 10.3778/j.issn.1002-8331.2411-0205

• 热点与综述 • 上一篇    下一篇

音频驱动人脸图像生成综述

胡原平,阎红灿   

  1. 1.华北理工大学 理学院,河北 唐山 063210 
    2.河北省数据科学与应用重点实验室,河北 唐山 063210
  • 出版日期:2025-09-01 发布日期:2025-09-01

Review of Audio Driven Face Image Generation

HU Yuanping, YAN Hongcan   

  1. 1.School of Science, North China University of Science and Technology, Tangshan, Hebei 063210, China
    2.Hebei Key Laboratory of Data Science and Application, Tangshan, Hebei 063210, China
  • Online:2025-09-01 Published:2025-09-01

摘要: 音频驱动的人脸图像生成技术旨在通过输入音频和静态图像(或视频)生成对应的动态说话视频,在虚拟角色交互、数字媒体创作、游戏开发等多个领域展现出显著的应用潜力,具有广阔的研究前景和重要的研究价值。在分类分析常用的音频特征提取方法、中间表示方法和特征融合方法的基础上,对基于生成对抗网络、神经辐射场和扩散模型的音频驱动人脸解决方案做总结概述,通过分析各方案的关键技术、对比部分方案的生成效果,归纳出各方案在生成图像质量、唇形同步性以及实时性等方面的优势与不足。辨析常用的数据集和评价指标,指出目前音频驱动人脸图像生成存在的挑战,并对未来可能的研究方向作出展望。

关键词: 音频驱动, 人脸图像生成, 多模态, 神经网络, 数字人

Abstract: Voice-driven face image generation technology aims to generate the corresponding dynamic speech video by inputting audio and static image (or video). This technology shows significant application potential in many fields such as virtual role interaction, digital media creation, game development and so on, which has broad research prospects and important research value. On the basis of classifying and analyzing the commonly used audio feature extraction methods, intermediate representation methods and feature fusion methods, this paper summarizes the audio driven face solutions based on the generation countermeasure network, neural radiation field and diffusion model. By analyzing the key technologies of each scheme and comparing the generation effects of some schemes, the advantages and disadvantages of each scheme in image quality, lip synchronicity and real-time performance are summarized. This paper analyzes the commonly used data sets and evaluation indicators, points out the current challenges in voice-driven face image generation, and looks forward to the possible research directions in the future.

Key words: audio driver, face image generation, multimodal, neural network, digital human