
计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (17): 33-46.DOI: 10.3778/j.issn.1002-8331.2411-0205
胡原平,阎红灿
出版日期:2025-09-01
发布日期:2025-09-01
HU Yuanping, YAN Hongcan
Online:2025-09-01
Published:2025-09-01
摘要: 音频驱动的人脸图像生成技术旨在通过输入音频和静态图像(或视频)生成对应的动态说话视频,在虚拟角色交互、数字媒体创作、游戏开发等多个领域展现出显著的应用潜力,具有广阔的研究前景和重要的研究价值。在分类分析常用的音频特征提取方法、中间表示方法和特征融合方法的基础上,对基于生成对抗网络、神经辐射场和扩散模型的音频驱动人脸解决方案做总结概述,通过分析各方案的关键技术、对比部分方案的生成效果,归纳出各方案在生成图像质量、唇形同步性以及实时性等方面的优势与不足。辨析常用的数据集和评价指标,指出目前音频驱动人脸图像生成存在的挑战,并对未来可能的研究方向作出展望。
胡原平, 阎红灿. 音频驱动人脸图像生成综述[J]. 计算机工程与应用, 2025, 61(17): 33-46.
HU Yuanping, YAN Hongcan. Review of Audio Driven Face Image Generation[J]. Computer Engineering and Applications, 2025, 61(17): 33-46.
| [1] BREGLER C, COVELL M, SLANEY M. Video rewrite: driving visual speech with audio[C]//Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, 1997. [2] RICHARDS L E, JOLLIFFE I T. Principal component analysis[J]. Journal of Marketing Research, 1988, 25(4): 410. [3] SIROVICH L, KIRBY M. Low-dimensional procedure for the characterization of human faces[J]. Journal of the Optical Society of America A-Optics Image Science and Vision, 1987, 4(3): 519-524. [4] BLANZ V, VETTER T. A morphable model for the synthesis of 3D faces[C]//Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. New York: ACM, 1999: 187-194. [5] XIE L, LIU Z Q. Realistic mouth-synching for speech-driven talking face using articulatory modelling[J]. IEEE Transactions on Multimedia, 2007, 9(3): 500-510. [6] YU H, GARROD O G B, SCHYNS P G. Perception-driven facial expression synthesis[J]. Computers & Graphics, 2012, 36(3): 152-162. [7] SUWAJANAKORN S, SEITZ S M, KEMELMACHER-SHLIZERMAN I. Synthesizing Obama[J]. ACM Transactions on Graphics, 2017, 36(4): 1-13. [8] WANG S, LI L, DING Y, et al. Audio2Head: audio-driven one-shot talking-head generation with natural head motion[J]. arXiv:2107.09293, 2021. [9] ZHOU Y, HAN X T, SHECHTMAN E, et al. MakeltTalk[J]. ACM Transactions on Graphics, 2020, 39(6): 1-15. [10] 刘龙, 李浩生, 张梦璇, 等. 基于深度学习的人脸动画驱动方法综述[J]. 西安电子科技大学学报, 2025, 52(2): 57-84. LIU L, LI H S, ZHANG M X, et al. Review of deep learning-based methods for driving facial animation[J]. Journal of Xidian University, 2025, 52(2): 57-84. [11] JIANG D Q, CHANG J, YOU L H, et al. Audio-driven facial animation with deep learning: a survey[J]. Inform-ation, 2024, 15(11): 675. [12] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[J]. arXiv:1406.2661, 2014. [13] MILDENHALL B, SRINIVASAN P P, TANCIK M, et al. NeRF: representing scenes as neural radiance fields for view synthesis[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 405-421. [14] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems, 2020: 6840-6851. [15] SONG J, MENG C, ERMON S. Denoising diffusion implicit models[J]. arXiv:2010.02502, 2020. [16] AMODEI D, ANANTHANARAYANAN S, ANUBHAI R, et al. Deep Speech 2: end-to-end speech recognition in English and mandarin[J]. arXiv:1512.02595, 2015. [17] GRAVES A, FERNáNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006: 369-376. [18] SCHNEIDER S, BAEVSKI A, COLLOBERT R, et al. Wav2vec: unsupervised pre-training for speech recognition[J]. arXiv:1904.05862, 2019. [19] BAEVSKI A, SCHNEIDER S, AULI M. Vq-wav2vec: self-supervised learning of discrete speech representations[J]. arXiv:1910.05453, 2019. [20] BAEVSKI A, ZHOU H, MOHAMED A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[J]. arXiv:2006.11477 , 2020. [21] HSU W N, BOLTE B, TSAI Y H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460. [22] JOSHI M, CHEN D Q, LIU Y H, et al. SpanBERT: improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77. [23] 吕建峰, 邵立珍, 雷雪梅. 基于深度神经网络的图像修复算法综述[J]. 计算机工程与应用, 2023, 59(20): 1-12. LYU J F, SHAO L Z, LEI X M. Image inpainting algorithm based on deep neural networks[J]. Computer Engineering and Applications, 2023, 59(20): 1-12. [24] VOUGIOUKAS K, PETRIDIS S, PANTIC M. End-to-end speech-driven facial animation with temporal GANs[J]. arXiv:1805.09313, 2018. [25] VOUGIOUKAS K, PETRIDIS S, PANTIC M. Realistic speech-driven facial animation with GANs[J]. arXiv:1906. 06337, 2019. [26] CHEN L L, MADDOX R K, DUAN Z Y, et al. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 7824-7833. [27] ZHOU H, LIU Y, LIU Z W, et al. Talking face generation by adversarially disentangled audio-visual representation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 9299-9306. [28] DAS D, BISWAS S, SINHA S, et al. Speech-driven facial animation using cascaded GANs for learning of motion and texture[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 408-424. [29] ZHOU H, SUN Y S, WU W, et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 4174-4184. [30] GU K X, ZHOU Y Q, HUANG T. FLNet: landmark driven fetching and learning network for faithful talking facial animation synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 10861-10868. [31] PRAJWAL K R, MUKHOPADHYAY R, NAMBOODIRI V P, et al. A lip sync expert is all you need for speech to lip generation in the wild[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 484-492. [32] CHUNG J S, ZISSERMAN A. Out of time: automated lip sync in the wild[C]//Proceedings of the Asian Conference on Computer Vision. Cham: Springer International Publishing, 2017: 251-263. [33] ZHANG Z M, HU Z P, DENG W J, et al. DINet: deform-ation inpainting network for realistic face visually dubbing on high resolution video[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2023: 3543-3551. [34] 马汉声, 祝玉华, 李智慧, 等. 神经辐射场多视图合成技术综述[J]. 计算机工程与应用, 2024, 60(4): 21-38. MA H S, ZHU Y H, LI Z H, et al. Survey of neural radiance fields for multi-view synthesis technologies[J]. Computer Engineering and Applications, 2024, 60(4): 21-38. [35] GUO Y D, CHEN K Y, LIANG S, et al. AD-NeRF: audio driven neural radiance fields for talking head synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 5764-5774. [36] LIU X, XU Y H, WU Q Y, et al. Semantic-aware implicit neural audio-driven video portrait generation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2022: 106-125. [37] YAO S Y, ZHONG R Z, YAN Y C, et al. DFA-NeRF: personalized talking head generation via disentangled face attributes neural rendering[J]. arXiv:2201.00791, 2022. [38] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010. [39] SHEN S, LI W H, ZHU Z, et al. Learning dynamic facial radiance fields for few-shot talking head synthesis[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2022: 666-682. [40] TANG J, WANG K, ZHOU H, et al. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition[J]. arXiv:2211.12368, 2022. [41] LI J H, ZHANG J W, BAI X, et al. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 7534-7544. [42] CHAN E R, LIN C Z, CHAN M A, et al. Efficient geometry-aware 3D generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 16102-16112. [43] YE Z H, JIANG Z Y, REN Y, et al. GeneFace: generalized and high-fidelity audio-driven 3D talking face synthesis[J]. arXiv:2301.13430, 2023. [44] YE Z, HE J, JIANG?Z, et al. GeneFace++: generalized and stable real-time audio-driven 3D talking face generation[J]. arXiv:2305.00787, 2023. [45] PENG Z Q, HU W T, SHI Y, et al. SyncTalk: the devil is in the synchronization for talking head synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 666-676. [46] KINGMA D P. Auto-encoding variational Bayes[J]. arXiv:1312.6114, 2013. [47] CARL D. Tutorial on variational autoencoders[J]. arXiv:1606. 05908, 2016. [48] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer International Publishing, 2015: 234-241. [49] YU Z, YIN Z, ZHOU D, et al. Talking head generation with probabilistic audio-to-visual diffusion priors[J]. arXiv:2212. 04248, 2022. [50] ZHANG W X, CUN X D, WANG X, et al. SadTalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 8652-8661. [51] WANG T C, MALLYA A, LIU M Y. One-shot free-view neural talking-head synthesis for video conferencing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 10034-10044. [52] SHEN S, ZHAO W L, MENG Z B, et al. DiffTalk: crafting diffusion models for generalized audio-driven portraits animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 1982-1991. [53] MA Y F, ZHANG S W, WANG J Y, et al. DreamTalk: when expressive talking head generation meets diffusion probabilistic models[J]. arXiv:2312.09767, 2023. [54] TIAN L R, WANG Q, ZHANG B, et al. EMO: emote portrait alive-generating expressive portrait videos with Audio2-Video diffusion model under weak conditions[J]. arXiv:2402. 17485, 2024. [55] WEI H, YANG Z, WANG Z. AniPortrait: audio-driven synthesis of photorealistic portrait animation[J]. arXiv:2403. 17694, 2024. [56] LI H. Animate Anyone: consistent and controllable image-to-video synthesis for character animation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 8153-8163. [57] WANG C, TIAN K, ZHANG J, et al. V-Express: conditional dropout for progressive training of portrait video generation[J]. arXiv:2406.02511, 2024. [58] XU M, LI H, SU Q, et al. Hallo: hierarchical audio-driven visual synthesis for portrait image animation[J]. arXiv:2406. 08801, 2024. [59] CHEN Z, CAO J, CHEN Z, et al. EchoMimic: lifelike audio-driven portrait animations through editable landmark cond-itions[J]. arXiv:2407.08136, 2024. [60] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb: a large-scale speaker identification dataset[J]. arXiv:1706.08612, 2017. [61] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: deep speaker recognition[J]. arXiv:1806.05622, 2018. [62] ZHANG Z M, LI L C, DING Y, et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3660-3669. [63] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//Proceedings of the 13th Asian Conference on Computer Vision. Cham: Springer International Publishing, 2017: 87-103. [64] AFOURAS T, CHUNG J S, SENIOR A, et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(12): 8717-8727. [65] WANG K, WU Q Y, SONG L S, et al. MEAD: a large-scale audio-visual dataset for emotional talking-face generation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 700-717. [66] XIE L B, WANG X T, ZHANG H L, et al. VFHQ: a high-quality dataset and benchmark for video face super-resolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2022: 656-665. [67] WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612. [68] CHEN L L, LI Z H, MADDOX R K, et al. Lip movements generation at a glance[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 538-553. [69] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 586-595. [70] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv: 1409, 1556, 2014. [71] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Image-Net classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [72] XIA M F, SHEN Y J, LEI C S, et al. Towards more accurate diffusion model acceleration with a timestep tuner[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 5736-5745. [73] CHAI W L, ZHENG D D, CAO J J, et al. SpeedUpNet: a plug-and-play adapter network for accelerating text-to-image diffusion models[J]. arXiv:2312.08887, 2023. [74] WANG H J, LIU D F, KANG Y, et al. Attention-driven training-free efficiency enhancement of diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 16080-16089. [75] YANG Z, TENG J, ZHENG W, et al. CogVideoX: text-to-video diffusion models with an expert transformer[J]. arXiv:2408.06072, 2024. |
| [1] | 董磊, 吴福居, 史健勇, 潘龙飞. 基于大语言模型的施工安全多模态知识图谱的构建与应用[J]. 计算机工程与应用, 2025, 61(9): 325-333. |
| [2] | 彭小红, 邓峰, 余应淮. 南美白对虾养殖领域中文命名实体识别数据集构建[J]. 计算机工程与应用, 2025, 61(9): 353-362. |
| [3] | 郝鹤菲, 张龙豪, 崔洪振, 朱宵月, 彭云峰, 李向晖. 深度神经网络在人体姿态估计中的应用综述[J]. 计算机工程与应用, 2025, 61(9): 41-60. |
| [4] | 刘桂红, 焦琛添. 融入用户意图的图交互新闻推荐模型[J]. 计算机工程与应用, 2025, 61(9): 159-167. |
| [5] | 颜政锦, 叶正, 葛君. 多模态预训练模型在金融票据信息抽取中的应用[J]. 计算机工程与应用, 2025, 61(9): 186-193. |
| [6] | 庞俊, 马志芬, 林晓丽, 王蒙湘. 结合GAT与卷积神经网络的知识超图链接预测[J]. 计算机工程与应用, 2025, 61(9): 194-201. |
| [7] | 陈虹, 由雨竹, 金海波, 武聪, 邹佳澎. 融合改进采样技术和SRFCNN-BiLSTM的入侵检测方法[J]. 计算机工程与应用, 2025, 61(9): 315-324. |
| [8] | 张吴波, 邹旺, 熊黎, 戴顺鄂, 吴文欢. 多通道句法门控图神经网络用于句子级情感分析[J]. 计算机工程与应用, 2025, 61(8): 135-144. |
| [9] | 孟维超, 卞春江, 聂宏宾. 复杂背景下低信噪比红外弱小目标检测方法[J]. 计算机工程与应用, 2025, 61(8): 183-193. |
| [10] | 吕光宏, 王坤. 时空图注意力机制下的SDN网络动态流量预测[J]. 计算机工程与应用, 2025, 61(8): 267-273. |
| [11] | 何李杰, 高茂庭. 基于交叉注意力的点击率预测模型[J]. 计算机工程与应用, 2025, 61(7): 353-360. |
| [12] | 赵恩浩, 凌捷. 基于GAN的无数据黑盒对抗攻击方法[J]. 计算机工程与应用, 2025, 61(7): 204-212. |
| [13] | 田侃, 曹新汶, 张浩然, 先兴平, 吴涛, 宋秀丽. 结合图卷积模型和共享编码的知识图谱问答方法[J]. 计算机工程与应用, 2025, 61(7): 233-244. |
| [14] | 姚丽莎. 深度度量注意力混合模型表情识别方法[J]. 计算机工程与应用, 2025, 61(7): 245-254. |
| [15] | 吴波, 张荣芬, 刘宇红. 改进ViT的RGB-T多模态交互跟踪算法研究[J]. 计算机工程与应用, 2025, 61(7): 267-277. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||