低资源条件下的语音合成方法综述

doi:10.3778/j.issn.1002-8331.2211-0322

摘要/Abstract

摘要： 语音合成是人机交互领域的热门研究方向。深度学习时代以来，其研究重心由低效的传统方法转向基于神经网络的端到端语音合成技术，但在小语种语料数据、目标说话人语音训练数据或大型情感语音数据集收集困难的低数据资源情况下，构建成熟的语音合成系统仍是研究难点。故对语音合成的经典模型做分类介绍，围绕低资源问题的国内外研究现状做系统综述。从语音合成系统的组成结构与模型训练角度，分别阐述近年提升语音合成模型总体性能的主流技术，并总结了适用于语音合成不同任务的包含多种语言、多种情感、多位说话人的各类开源语音数据集。对应用深度学习和机器学习如迁移学习、元学习、数据增广等手段的解决低资源语音合成方法进行概述分析与优缺点比较，简要介绍少样本场景下的说话人自适应、语音克隆与转换等技术。对缓解低资源语音合成问题的可行研究方向进行探讨与展望。

关键词: 语音合成, 低资源, 数据增广, 迁移学习, 元学习, 微调

Abstract: Speech synthesis is a hot research direction in the field of human-computer interaction. Since the era of deep learning, its research focus has shifted from inefficient traditional methods to end-to-end speech synthesis technology based on neural networks. However, in the case of low data resources where it is difficult to collect minority language corpus data, target speaker speech training data or large emotional speech datasets, building a mature speech synthesis system is still a research difficulty. Therefore, the classic models of speech synthesis are introduced in categories, and the research status at home and abroad on low resource issues are systematically reviewed. From the perspective of the composition structure and model training of speech synthesis systems, the mainstream technologies to improve the overall performance of speech synthesis models in recent years are described respectively. It also summarizes various kinds of open source speech datasets that are applicable to different tasks of speech synthesis including multi-language, multi-emotion and multi-speaker. This paper summarizes, analyzes and compares the advantages and disadvantages of low resource speech synthesis methods using deep learning and machine learning, such as transfer learning, meta learning, data augmentation, etc. This paper also briefly introduces speaker adaptation, voice cloning and conversion technologies in few-shot scenario. Finally, the feasible research directions to alleviate the problem of low resource speech synthesis are discussed and prospected.

Key words: speech synthesis, low resource, data augmentation, transfer learning, meta learning, fine-tuning

张佳琳, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 低资源条件下的语音合成方法综述[J]. 计算机工程与应用, 2023, 59(15): 1-16.

ZHANG Jialin, Mairidan Wushouer, Gulanbaier Tuerhong. Review of Speech Synthesis Methods Under Low-Resource Condition[J]. Computer Engineering and Applications, 2023, 59(15): 1-16.

参考文献

[1] TAN X，QIN T，SOONG F，et al.A survey on neural speech synthesis[J].arXiv：2106.15561，2021.
[2] XU J，TAN X，REN Y，et al.Lrspeech：extremely low-resource speech synthesis and recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining，2020：2802-2812.
[3] 魏伟华.语音合成技术综述及研究现状[J].软件，2020，41（12）：214-217.
WEI W H.Overview and research status of speech synthesis technology[J].Computer Engineering & Software，2020，41（12）：214-217.
[4] 张小峰，谢钧，罗健欣，等.深度学习语音合成技术综述[J].计算机工程与应用，2021，57（9）：50-59.
ZHANG X F，XIE J，LUO J X，et al.Overview of deep learning speech synthesis technology[J].Computer Engineering and Applications，2021，57（9）：50-59.
[5] 潘孝勤，芦天亮，杜彦辉，等.基于深度学习的语音合成与转换技术综述[J].计算机科学，2021，48（8）：200-208.
PAN X Q，LU T L，DU Y H，et al.Overview of speech synthesis and voice conversion technology based on deep learning[J].Computer Science，2021，48（8）：200-208.
[6] NING Y，HE S，WU Z，et al.A review of deep learning based speech synthesis[J].Applied Sciences，2019，9（19）：4050.
[7] MU Z，YANG X，DONG Y.Review of end-to-end speech synthesis technology based on deep learning[J].arXiv：2104.09995，2021.
[8] 张丹烽，李冠宇，赵英娣.语音合成技术发展综述与研究现状[J].科技风，2017（22）：72.
ZHANG D F，LI G Y，ZHAO Y D.Speech synthesis technology development review and research status[J].Technology Wind，2017（22）：72.
[9] 李虎孬，赵晖.情感语音合成综述[J].现代计算机，2014（20）：31-37.
LI H N，ZHAO H.Summary of emotional speech synthesis[J].Modern Computer，2014（20）：31-37.
[10] TRIANTAFYLLOPOULOS A，SCHULLER B W，?YMEN G，et al.An overview of affective speech synthesis and conversion in the deep learning era[J].arXiv：2210.03538，2022.
[11] 唐浩彬，张旭龙，王健宗，等.表现性语音合成综述[J/OL].大数据：1-23[2022-12-09].http：//kns.cnki.net/kcms/detail/10.1321.G2.20221108.1439.006.html.
TANG H B，ZHANG X L，WANG J Z，et al.A survey of expressive speech synthesis[J/OL].Big Data Research：1-23[2022-12-09].http：//kns.cnki.net/kcms/detail/10.1321.G2.20221108.1439.006.html.
[12] 曹亮，赵晖.具有情感表现力的可视语音合成研究综述[J].计算机工程与科学，2015，37（4）：813-818.
CAO L，ZHAO H.A survey of emotional visual speech synthesis[J].Computer Engineering and Science，2015，37（4）：813-818.
[13] MATTHEYSES W，VERHELST W.Audiovisual speech synthesis：an overview of the state-of-the-art[J].Speech Communication，2015，66：182-217.
[14] SHI Z.A survey on audio synthesis and audio-visual multimodal processing[J].arXiv：2108.00443，2021.
[15] 陶建华，巫英才，喻纯，等.多模态人机交互综述[J].中国图象图形学报，2022，27（6）：1956-1987.
TAO J H，WU Y C，YU C，et al.A survey on multi-modal human-computer interaction[J].Journal of Image and Graphics，2022，27（6）：1956-1987.
[16] 井晓阳，罗飞，王亚棋.汉语语音合成技术综述[J].计算机科学，2012，39（S3）：386-390.
JING X Y，LUO F，WANG Y Q.Overview of the Chinese voice synthesis technique[J].Computer Science，2012，39（S3）：386-390.
[17] 杨帅，乔凯，陈健，等.语音合成及伪造、鉴伪技术综述[J].计算机系统应用，2022，31（7）：12-22.
YANG S，QIAO K，CHEN J，et al.Overview on speech synthesis，forgery and detection technology[J].Computer Systems and Applications，2022，31（7）：12-22.
[18] WANG Y，SKERRY-RYAN R J，STANTON D，et al.Tacotron：towards end-to-end speech synthesis[J].arXiv：1703.10135，2017.
[19] SHEN J，PANG R，WEISS R J，et al.Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2018：4779-4783.
[20] SOTELO J，MEHRI S，KUMAR K，et al.Char2Wav：end-to-end speech synthesis[C]//International Conference on Learning Representations，2017.
[21] YU C，LU H，HU N，et al.Durian：duration informed attention network for multimodal synthesis[J].arXiv：1909.01700，2019.
[22] VALLE R，SHIH K，PRENGER R，et al.Flowtron：an autoregressive flow-based generative network for text-to-speech synthesis[J].arXiv：2005.05957，2020.
[23] WANG Y，STANTON D，ZHANG Y，et al.Style tokens：unsupervised style modeling，control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning，2018：5180-5189.
[24] SKERRY-RYAN R J，BATTENBERG E，XIAO Y，et al.Towards end-to-end prosody transfer for expressive speech synthesis with tacotron[C]//International Conference on Machine Learning，2018：4693-4702.
[25] WEISS R J，SKERRY-RYAN R J，BATTENBERG E，et al.Wave-tacotron：spectrogram-free end-to-end text-to-speech synthesis[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：5679-5683.
[26] SHEN J，JIA Y，CHRZANOWSKI M，et al.Non-attentive tacotron：robust and controllable neural TTS synthesis including unsupervised duration modeling[J].arXiv：2010. 04301，2020.
[27] LI N，LIU S，LIU Y，et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：6706-6713.
[28] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017.
[29] ARIK S ?，CHRZANOWSKI M，COATES A，et al.Deep voice：real-time neural text-to-speech[C]//International Conference on Machine Learning，2017：195-204.
[30] GIBIANSKY A，ARIK S，DIAMOS G，et al.Deep voice 2：multi-speaker neural text-to-speech[C]//Advances in Neural Information Processing Systems，2017.
[31] PING W，PENG K，GIBIANSKY A，et al.Deep voice 3：scaling text-to-speech with convolutional sequence learning[J].arXiv：1710.07654，2017.
[32] ZHANG L，YU C，LU H，et al.Duriansc：duration informed attention network based singing voice conversion system[J].arXiv：2008.03009，2020.
[33] HSU W N，ZHANG Y，WEISS R J，et al.Hierarchical generative modeling for controllable speech synthesis[J].arXiv：1810.07217，2018.
[34] JIA Y，ZHANG Y，WEISS R，et al.Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]//Advances in Neural Information Processing Systems，2018.
[35] VALLE R，LI J，PRENGER R，et al.Mellotron：multispeaker expressive voice synthesis by conditioning on rhythm，pitch and global style tokens[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：6189-6193.
[36] ELIAS I，ZEN H，SHEN J，et al.Parallel tacotron：non-autoregressive and controllable tts[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：5709-5713.
[37] ELIAS I，ZEN H，SHEN J，et al.Parallel tacotron 2：a non-autoregressive neural TTS model with differentiable duration modeling[J].arXiv：2103.14574，2021.
[38] CHOI S，HAN S，KIM D，et al.Attentron：few-shot text-to-speech utilizing attention-based variable-length embedding[J].arXiv：2005.08484，2020.
[39] BATTENBERG E，SKERRY-RYAN R J，MARIOORYAD S，et al.Location-relative attention mechanisms for robust long-form speech synthesis[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：6194-6198.
[40] ZHANG Y J，PAN S，HE L，et al.Learning latent representations for style control and transfer in end-to-end speech synthesis[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：6945-6949.
[41] REN Y，RUAN Y，TAN X，et al.Fastspeech：fast，robust and controllable text to speech[C]//Advances in Neural Information Processing Systems，2019.
[42] REN Y，HU C，TAN X，et al.Fastspeech 2：fast and high-quality end-to-end text to speech[J].arXiv：2006.04558，2020.
[43] ?A?CUCKI A.Fastpitch：parallel text-to-speech with pitch prediction[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：6588-6592.
[44] PENG K，PING W，SONG Z，et al.Non-autoregressive neural text-to-speech[C]//International Conference on Machine Learning，2020：7586-7598.
[45] PENG K，PING W，SONG Z，et al.Parallel neural text-to-speech[J].arXiv：1905.08459，2019.
[46] OORD A，DIELEMAN S，ZEN H，et al.Wavenet：a generative model for raw audio[J].arXiv：1609.03499，2016.
[47] PING W，PENG K，CHEN J.Clarinet：parallel wave generation in end-to-end text-to-speech[J].arXiv：1807. 07281，2018.
[48] DONAHUE J，DIELEMAN S，BI?KOWSKI M，et al.End-to-end adversarial text-to-speech[J].arXiv：2006. 03575，2020.
[49] KIM J，KONG J，SON J.Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech[C]//International Conference on Machine Learning，2021：5530-5540.
[50] BALJEKAR P.Speech synthesis from found data[D].Pittsburgh，PA：Carnegie Mellon University，2018.
[51] 陈虹洁.面向低资源场景的语音表示学习及其应用[D].西安：西北工业大学，2018.
CHEN H J.Low-resource speech representation learning and its applications[D].Xi’an：Northwestern Polytechnical University，2018.
[52] TITS N，EL HADDAD K，DUTOIT T.Exploring transfer learning for low resource emotional TTS[C]//Proceedings of SAI Intelligent Systems Conference.Cham：Springer，2019：52-60.
[53] AZIZAH K，ADRIANI M，JATMIKO W.Hierarchical transfer learning for multilingual，multi-speaker，and style transfer DNN-based TTS on low-resource languages[J].IEEE Access，2020，8：179798-179812.
[54] HUYBRECHTS G，MERRITT T，COMINI G，et al.Low-resource expressive text-to-speech using data augmentation[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：6593-6597.
[55] BYAMBADORJ Z，NISHIMURA R，AYUSH A，et al.Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation[J].EURASIP Journal on Audio，Speech，and Music Processing，2021（1）：1-20.
[56] OH S，KWON O，HWANG M J，et al.Effective data augmentation methods for neural text-to-speech systems[C]//2022 International Conference on Electronics，Information，and Communication（ICEIC），2022：1-4.
[57] RIBEIRO M S，ROTH J，COMINI G，et al.Cross-speaker style transfer for text-to-speech using data augmentation[C]//2022 IEEE International Conference on Acoustics，Speech and Signal Processing，2022：6797-6801.
[58] HWANG M J，YAMAMOTO R，SONG E，et al.TTS-by-TTS：TTS-driven data augmentation for fast and high-quality speech synthesis[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：6598-6602.
[59] NACHMANI E，WOLF L.Unsupervised polyglot text-to-speech[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：7055-7059.
[60] ZHANG H，LIN Y.Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages[J].arXiv：2008.04549，2020.
[61] CHUNG Y A，WANG Y，HSU W N，et al.Semi-supervised training for improving data efficiency in end-to-end speech synthesis[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：6940-6944.
[62] HUANG S F，LIN C J，LIU D，et al.Meta-TTS：meta-learning for few-shot speaker adaptive text-to-speech[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2022.
[63] LUX F，VU N T.Language-agnostic meta-learning for low-resource text-to-speech with articulatory features[J].arXiv：2203.03191，2022.
[64] HEDDERICH M A，LANGE L，ADEL H，et al.A survey on recent approaches for natural language processing in low-resource scenarios[J].arXiv：2010.12309，2020.
[65] HAYASHI T，WATANABE S，TODA T，et al.Pre-trained text embeddings for enhanced text-to-speech synthesis[C]//Interspeech，2019：4430-4434.
[66] DEVLIN J，CHANG M W，LEE K，et al.Bert：pretraining of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[67] ZHANG Y，DENG L，WANG Y.Unified mandarin TTS front-end based on distilled bert model[J].arXiv：2012. 15404，2020.
[68] FANG W，CHUNG Y A，GLASS J.Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models[J].arXiv：1906.07307，2019.
[69] JIA Y，ZEN H，SHEN J，et al.Png BERT：augmented BERT on phonemes and graphemes for neural TTS[J].arXiv：2103.15060，2021.
[70] KASTNER K，SANTOS J F，BENGIO Y，et al.Representation mixing for TTS synthesis[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：5906-5910.
[71] HEMATI H，BORTH D.Using ipa-based tacotron for data efficient cross-lingual speaker adaptation and pronunciation enhancement[J].arXiv：2011.06392，2020.
[72] TERASHIMA R，YAMAMOTO R，SONG E，et al.Cross-speaker emotion transfer for low-resource text-to-speech using non-parallel voice conversion with pitch-shift data augmentation[J].arXiv：2204.10020，2022.
[73] REN Y，TAN X，QIN T，et al.Almost unsupervised text to speech and automatic speech recognition[C]//International Conference on Machine Learning，2019：5410-5419.
[74] COOPER E，WANG X，ZHAO Y，et al.Pretraining strategies，waveform model choice，and acoustic configurations for multi-speaker end-to-end speech synthesis[J].arXiv：2011.04839，2020.
[75] BAGCHI D，HARTMANN W.Learning from the best：a teacher-student multilingual framework for low-resource languages[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：6051-6055.
[76] YE Z，ZHAO Z，REN Y，et al.SyntaSpeech：syntax-aware generative adversarial text-to-speech[J].arXiv：2204.11792，2022.
[77] KUMAR K，KUMAR R，DE BOISSIERE T，et al.Melgan：generative adversarial networks for conditional waveform synthesis[C]//Advances in Neural Information Processing Systems，2019.
[78] YAMAMOTO R，SONG E，KIM J M.Parallel WaveGAN：a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：6199-6203.
[79] OORD A，LI Y，BABUSCHKIN I，et al.Parallel wavenet：fast high-fidelity speech synthesis[C]//International Conference on Machine Learning，2018：3918-3926.
[80] KIM S，LEE S，SONG J，et al.FloWaveNet：a generative flow for raw audio[J].arXiv：1811.02155，2018.
[81] KINGMA D P，DHARIWAL P.Glow：generative flow with invertible 1×1 convolutions[C]//Advances in Neural Information Processing Systems，2018，31.
[82] PRENGER R，VALLE R，CATANZARO B.Waveglow：a flow-based generative network for speech synthesis[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：3617-3621.
[83] VALIN J M，SKOGLUND J.LPCNet：improving neural speech synthesis through linear prediction[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing：5891-5895.
[84] MUSTAFA A，PIA N，FUCHS G.StyleMelGAN：an efficient high-fidelity adversarial vocoder with temporal adaptive normalization[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：6034-6038.
[85] YANG J，LEE J，KIM Y，et al.VocGAN：a high-fidelity real-time vocoder with a hierarchically-nested adversarial network[J].arXiv：2007.15256，2020.
[86] KONG J，KIM J，BAE J.Hifi-GAN：generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems，2020：17022-17033.
[87] SU J，JIN Z，FINKELSTEIN A.HiFi-GAN-2：studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features[C]//2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics，2021：166-170.
[88] SHI Y，BU H，XU X，et al.Aishell-3：a multi-speaker mandarin TTS corpus and the baselines[J].arXiv：2010. 11567，2020.
[89] RINGEVAL F，SONDEREGGER A，SAUER J，et al.Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions[C]//2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition，2013：1-8.
[90] KOSSAIFI J，WALECKI R，PANAGAKIS Y，et al.Sewa DB：a rich database for audio-visual emotion and sentiment research in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2019，43（3）：1022-1040.
[91] ZHOU K，SISMAN B，LIU R，et al.Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：920-924.
[92] BURKHARDT F，PAESCHKE A，ROLFES M，et al.A database of German emotional speech[C]//Interspeech，2005：1517-1520.
[93] BUSSO C，BULUT M，LEE C C，et al.IEMOCAP：interactive emotional dyadic motion capture database[J].Language Resources and Evaluation，2008，42（4）：335-359.
[94] TU T，CHEN Y J，YEH C，et al.End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning[J].arXiv：1904.06508，2019.
[95] YANG H，CHEN H，ZHOU H，et al.Enhancing cross-lingual transfer by manifold mixup[J].arXiv：2205.04182，2022.
[96] FAN Y，QIAN Y，SOONG F K，et al.Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis[C]//2015 IEEE International Conference on Acoustics，Speech and Signal Processing，2015：4475-4479.
[97] YANG J，HE L.Towards universal text-to-speech[C]//Interspeech，2020：3171-3175.
[98] GUTKIN A.Uniform multilingual multi-speaker acoustic model for statistical parametric speech synthesis of low-resourced languages[C]//Interspeech，2017：2183-2187.
[99] YU Q，LIU P，WU Z，et al.Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing，2016：5545-5549.
[100] CAO Y，WU X，LIU S，et al.End-to-end code-switched tts with mix of monolingual recordings[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：6935-6939.
[101] TJANDRA A，SAKTI S，NAKAMURA S.Listening while speaking：speech chain by deep learning[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop（ASRU），2017：301-308.
[102] TJANDRA A，SAKTI S，NAKAMURA S.Machine speech chain with one-shot speaker adaptation[J].arXiv：1803. 10525，2018.
[103] 侯俊龙，潘文林.基于元度量学习的低资源语音识别[J].云南民族大学学报（自然科学版），2021，30（3）：272-278.
HOU J L，PAN W L.Low-resource speech recognition based on meta-metric learning[J].Journal of Yunnan University of Nationalities（Natural Sciences Edition），2021，30（3）：272-278.
[104] NEKVINDA T，DU?EK O.One model，many languages：meta-learning for multilingual text-to-speech[J].arXiv：2008.00768，2020.
[105] PLATANIOS E A，SACHAN M，NEUBIG G，et al.Contextual parameter generation for universal neural machine translation[J].arXiv：1808.08493，2018.
[106] CHEN Y，ASSAEL Y，SHILLINGFORD B，et al.Sample efficient adaptive text-to-speech[J].arXiv：1809.10460，2018.
[107] HU Q，MARCHI E，WINARSKY D，et al.Neural text-to-speech adaptation from low quality public recordings[C]//Speech Synthesis Workshop，2019.
[108] FINN C，ABBEEL P，LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning，2017：1126-1135.
[109] COOPER E.Text-to-speech synthesis using found data for low-resource languages[D].Columbia University，2019.
[110] YAN Y，TAN X，LI B，et al.Adaspeech 3：adaptive text to speech for spontaneous style[J].arXiv：2107. 02530，2021.
[111] VALENTINI-BOTINHAO C，YAMAGISHI J.Speech enhancement of noisy and reverberant speech for text-to-speech[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2018，26（8）：1420-1433.
[112] ZHANG C，REN Y，TAN X，et al.Denoispeech：denoising text to speech with frame-level noise modeling[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing，2021：7063-7067.
[113] ZHANG Z，TIAN Q，LU H，et al.Adadurian：few-shot adaptation for neural text-to-speech with durian[J].arXiv：2005.05642，2020.
[114] YAMAGISHI J，NOSE T，ZEN H，et al.Robust speaker-adaptive HMM-based text-to-speech synthesis[J].IEEE Transactions on Audio，Speech，and Language Processing，2009，17（6）：1208-1230.
[115] CHEN M，TAN X，LI B，et al.Adaspeech：adaptive text to speech for custom voice[J].arXiv：2103.00993，2021.
[116] XIN D，SAITO Y，TAKAMICHI S，et al.Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis[J].Proceedings of Interspeech，2021：1614-1618.
[117] YAN Y，TAN X，LI B，et al.Adaspeech 2：adaptive text to speech with untranscribed data[C]//2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：6613-6617.
[118] 徐志航，陈博，张辉，等.小数据下的音素级别说话人嵌入的语音合成自适应方法[J].计算机学报，2022，45（5）：1003-1017.
XU Z H，CHEN B，ZHANG H，et al.Speech synthesis adaption method based on phoneme-level speaker embedding under small data[J].Chinese Journal of Computers，2022，45（5）：1003-1017.
[119] SONG K，XUE H，WANG X，et al.AdaVITS：tiny VITS for low computing resource speaker adaptation[J].arXiv：2206.00208，2022.
[120] CASANOVA E，SHULBY C，G?LGE E，et al.Sc-glowtts：an efficient zero-shot multi-speaker text-to-speech model[J].arXiv：2104.05557，2021.
[121] CASANOVA E，WEBER J，SHULBY C，et al.YourTTS：towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone[J].arXiv：2112.02418，2021.
[122] COOPER E，LAI C I，YASUDA Y，et al.Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：6184-6188.
[123] LEI Y，YANG S，CONG J，et al.Glow-WaveGAN 2：high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion[J].arXiv：2207.01832，2022.
[124] MANIATI G，ELLINAS N，MARKOPOULOS K，et al.Cross-lingual low resource speaker adaptation using phonological features[J].arXiv：2111.09075，2021.
[125] WU Y，TAN X，LI B，et al.AdaSpeech 4：adaptive text to speech in zero-shot scenarios[J].arXiv：2204. 00436，2022.
[126] ARIK S，CHEN J，PENG K，et al.Neural voice cloning with a few samples[C]//Advances in Neural Information Processing Systems，2018.
[127] BLAAUW M，BONADA J，DAIDO R.Data efficient voice cloning for neural singing synthesis[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：6840-6844.
[128] ZHANG Y，WEISS R J，ZEN H，et al.Learning to speak fluently in a foreign language：multilingual speech synthesis and cross-language voice cloning[J].arXiv：1907.04448，2019.
[129] LI R，PU D，HUANG M，et al.Unet-TTS：improving unseen speaker and style transfer in one-shot voice cloning[J].arXiv：2109.11115，2021.
[130] DAI D，CHEN Y，CHEN L，et al.Cloning one’s voice using very limited data in the wild[C]//2022 IEEE International Conference on Acoustics，Speech and Signal Processing，2022：8322-8326.
[131] KARLAPATI S，MOINET A，JOLY A，et al.Copycat：many-to-many fine-grained prosody transfer for neural text-to-speech[J].arXiv：2004.14617，2020.
[132] ZHAO S，NGUYEN T H，WANG H，et al.Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion[J].arXiv：2010.08136，2020.
[133] CHOU J，YEH C，LEE H.One-shot voice conversion by separating speaker and content representations with instance normalization[J].arXiv：1904.05742，2019.
[134] WU D Y，LEE H.One-shot voice conversion by vector quantization[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing，2020：7734-7738.
[135] BAAS M，KAMPER H.StarGAN-ZSVC：towards zero-shot voice conversion in low-resource contexts[C]//Southern African Conference for Artificial Intelligence Research.Cham：Springer，2021：69-84.
[136] LATORRE J，LACHOWICZ J，LORENZO-TRUEBA J，et al.Effect of data reduction on sequence-to-sequence neural tts[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing，2019：7075-7079.
[137] CHANG H J，LEE H，LEE L.Towards lifelong learning of end-to-end ASR[J].arXiv：2104.01616，2021.
[138] YANG M，DING S，CHEN T，et al.Towards lifelong learning of multilingual text-to-speech synthesis[C]//2022 IEEE International Conference on Acoustics，Speech and Signal Processing，2022：8022-8026.
[139] HEMATI H，BORTH D.Continual speaker adaptation for text-to-speech synthesis[J].arXiv：2103.14512，2021.
[140] BROWN T，MANN B，RYDER N，et al.Language models are few-shot learners[C]//Advances in Neural Information Processing Systems，2020：1877-1901.
[141] JEONG M，KIM H，CHEON S J，et al.Difftts：a denoising diffusion model for text-to-speech[J].arXiv：2104. 01409，2021.
[142] LAM M W Y，WANG J，SU D，et al.BDDM：bilateral denoising diffusion models for fast and high-quality speech synthesis[J].arXiv：2203.13508，2022.
[143] KANG M，MIN D，HWANG S J.Any-speaker adaptive text-to-speech synthesis with diffusion models[J].arXiv：2211.09383，2022.
[144] HUANG R，LAM M W Y，WANG J，et al.FastDiff：a fast conditional diffusion model for high-quality speech synthesis[J].arXiv：2204.09934，2022.
[145] POPOV V，VOVK I，GOGORYAN V，et al.Grad-TTS：a diffusion probabilistic model for text-to-speech[C]//International Conference on Machine Learning，2021：8599-8608.
[146] KONG Z，PING W，HUANG J，et al.Diffwave：a versatile diffusion model for audio synthesis[J].arXiv：2009. 09761，2020.
[147] COOPER E，LAI C I，YASUDA Y，et al.Can speaker augmentation improve multi-speaker end-to-end TTS?[J].arXiv：2005.01245，2020.