Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (15): 1-16.DOI: 10.3778/j.issn.1002-8331.2211-0322
• Research Hotspots and Reviews • Previous Articles Next Articles
ZHANG Jialin, Mairidan Wushouer, Gulanbaier Tuerhong
Online:
2023-08-01
Published:
2023-08-01
张佳琳,买日旦·吾守尔,古兰拜尔·吐尔洪
ZHANG Jialin, Mairidan Wushouer, Gulanbaier Tuerhong. Review of Speech Synthesis Methods Under Low-Resource Condition[J]. Computer Engineering and Applications, 2023, 59(15): 1-16.
张佳琳, 买日旦·吾守尔, 古兰拜尔·吐尔洪. 低资源条件下的语音合成方法综述[J]. 计算机工程与应用, 2023, 59(15): 1-16.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2211-0322
[1] TAN X,QIN T,SOONG F,et al.A survey on neural speech synthesis[J].arXiv:2106.15561,2021. [2] XU J,TAN X,REN Y,et al.Lrspeech:extremely low-resource speech synthesis and recognition[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,2020:2802-2812. [3] 魏伟华.语音合成技术综述及研究现状[J].软件,2020,41(12):214-217. WEI W H.Overview and research status of speech synthesis technology[J].Computer Engineering & Software,2020,41(12):214-217. [4] 张小峰,谢钧,罗健欣,等.深度学习语音合成技术综述[J].计算机工程与应用,2021,57(9):50-59. ZHANG X F,XIE J,LUO J X,et al.Overview of deep learning speech synthesis technology[J].Computer Engineering and Applications,2021,57(9):50-59. [5] 潘孝勤,芦天亮,杜彦辉,等.基于深度学习的语音合成与转换技术综述[J].计算机科学,2021,48(8):200-208. PAN X Q,LU T L,DU Y H,et al.Overview of speech synthesis and voice conversion technology based on deep learning[J].Computer Science,2021,48(8):200-208. [6] NING Y,HE S,WU Z,et al.A review of deep learning based speech synthesis[J].Applied Sciences,2019,9(19):4050. [7] MU Z,YANG X,DONG Y.Review of end-to-end speech synthesis technology based on deep learning[J].arXiv:2104.09995,2021. [8] 张丹烽,李冠宇,赵英娣.语音合成技术发展综述与研究现状[J].科技风,2017(22):72. ZHANG D F,LI G Y,ZHAO Y D.Speech synthesis technology development review and research status[J].Technology Wind,2017(22):72. [9] 李虎孬,赵晖.情感语音合成综述[J].现代计算机,2014(20):31-37. LI H N,ZHAO H.Summary of emotional speech synthesis[J].Modern Computer,2014(20):31-37. [10] TRIANTAFYLLOPOULOS A,SCHULLER B W,?YMEN G,et al.An overview of affective speech synthesis and conversion in the deep learning era[J].arXiv:2210.03538,2022. [11] 唐浩彬,张旭龙,王健宗,等.表现性语音合成综述[J/OL].大数据:1-23[2022-12-09].http://kns.cnki.net/kcms/detail/10.1321.G2.20221108.1439.006.html. TANG H B,ZHANG X L,WANG J Z,et al.A survey of expressive speech synthesis[J/OL].Big Data Research:1-23[2022-12-09].http://kns.cnki.net/kcms/detail/10.1321.G2.20221108.1439.006.html. [12] 曹亮,赵晖.具有情感表现力的可视语音合成研究综述[J].计算机工程与科学,2015,37(4):813-818. CAO L,ZHAO H.A survey of emotional visual speech synthesis[J].Computer Engineering and Science,2015,37(4):813-818. [13] MATTHEYSES W,VERHELST W.Audiovisual speech synthesis:an overview of the state-of-the-art[J].Speech Communication,2015,66:182-217. [14] SHI Z.A survey on audio synthesis and audio-visual multimodal processing[J].arXiv:2108.00443,2021. [15] 陶建华,巫英才,喻纯,等.多模态人机交互综述[J].中国图象图形学报,2022,27(6):1956-1987. TAO J H,WU Y C,YU C,et al.A survey on multi-modal human-computer interaction[J].Journal of Image and Graphics,2022,27(6):1956-1987. [16] 井晓阳,罗飞,王亚棋.汉语语音合成技术综述[J].计算机科学,2012,39(S3):386-390. JING X Y,LUO F,WANG Y Q.Overview of the Chinese voice synthesis technique[J].Computer Science,2012,39(S3):386-390. [17] 杨帅,乔凯,陈健,等.语音合成及伪造、鉴伪技术综述[J].计算机系统应用,2022,31(7):12-22. YANG S,QIAO K,CHEN J,et al.Overview on speech synthesis,forgery and detection technology[J].Computer Systems and Applications,2022,31(7):12-22. [18] WANG Y,SKERRY-RYAN R J,STANTON D,et al.Tacotron:towards end-to-end speech synthesis[J].arXiv:1703.10135,2017. [19] SHEN J,PANG R,WEISS R J,et al.Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2018:4779-4783. [20] SOTELO J,MEHRI S,KUMAR K,et al.Char2Wav:end-to-end speech synthesis[C]//International Conference on Learning Representations,2017. [21] YU C,LU H,HU N,et al.Durian:duration informed attention network for multimodal synthesis[J].arXiv:1909.01700,2019. [22] VALLE R,SHIH K,PRENGER R,et al.Flowtron:an autoregressive flow-based generative network for text-to-speech synthesis[J].arXiv:2005.05957,2020. [23] WANG Y,STANTON D,ZHANG Y,et al.Style tokens:unsupervised style modeling,control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning,2018:5180-5189. [24] SKERRY-RYAN R J,BATTENBERG E,XIAO Y,et al.Towards end-to-end prosody transfer for expressive speech synthesis with tacotron[C]//International Conference on Machine Learning,2018:4693-4702. [25] WEISS R J,SKERRY-RYAN R J,BATTENBERG E,et al.Wave-tacotron:spectrogram-free end-to-end text-to-speech synthesis[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2021:5679-5683. [26] SHEN J,JIA Y,CHRZANOWSKI M,et al.Non-attentive tacotron:robust and controllable neural TTS synthesis including unsupervised duration modeling[J].arXiv:2010. 04301,2020. [27] LI N,LIU S,LIU Y,et al.Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:6706-6713. [28] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems,2017. [29] ARIK S ?,CHRZANOWSKI M,COATES A,et al.Deep voice:real-time neural text-to-speech[C]//International Conference on Machine Learning,2017:195-204. [30] GIBIANSKY A,ARIK S,DIAMOS G,et al.Deep voice 2:multi-speaker neural text-to-speech[C]//Advances in Neural Information Processing Systems,2017. [31] PING W,PENG K,GIBIANSKY A,et al.Deep voice 3:scaling text-to-speech with convolutional sequence learning[J].arXiv:1710.07654,2017. [32] ZHANG L,YU C,LU H,et al.Duriansc:duration informed attention network based singing voice conversion system[J].arXiv:2008.03009,2020. [33] HSU W N,ZHANG Y,WEISS R J,et al.Hierarchical generative modeling for controllable speech synthesis[J].arXiv:1810.07217,2018. [34] JIA Y,ZHANG Y,WEISS R,et al.Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]//Advances in Neural Information Processing Systems,2018. [35] VALLE R,LI J,PRENGER R,et al.Mellotron:multispeaker expressive voice synthesis by conditioning on rhythm,pitch and global style tokens[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing,2020:6189-6193. [36] ELIAS I,ZEN H,SHEN J,et al.Parallel tacotron:non-autoregressive and controllable tts[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing,2021:5709-5713. [37] ELIAS I,ZEN H,SHEN J,et al.Parallel tacotron 2:a non-autoregressive neural TTS model with differentiable duration modeling[J].arXiv:2103.14574,2021. [38] CHOI S,HAN S,KIM D,et al.Attentron:few-shot text-to-speech utilizing attention-based variable-length embedding[J].arXiv:2005.08484,2020. [39] BATTENBERG E,SKERRY-RYAN R J,MARIOORYAD S,et al.Location-relative attention mechanisms for robust long-form speech synthesis[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing,2020:6194-6198. [40] ZHANG Y J,PAN S,HE L,et al.Learning latent representations for style control and transfer in end-to-end speech synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:6945-6949. [41] REN Y,RUAN Y,TAN X,et al.Fastspeech:fast,robust and controllable text to speech[C]//Advances in Neural Information Processing Systems,2019. [42] REN Y,HU C,TAN X,et al.Fastspeech 2:fast and high-quality end-to-end text to speech[J].arXiv:2006.04558,2020. [43] ?A?CUCKI A.Fastpitch:parallel text-to-speech with pitch prediction[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing,2021:6588-6592. [44] PENG K,PING W,SONG Z,et al.Non-autoregressive neural text-to-speech[C]//International Conference on Machine Learning,2020:7586-7598. [45] PENG K,PING W,SONG Z,et al.Parallel neural text-to-speech[J].arXiv:1905.08459,2019. [46] OORD A,DIELEMAN S,ZEN H,et al.Wavenet:a generative model for raw audio[J].arXiv:1609.03499,2016. [47] PING W,PENG K,CHEN J.Clarinet:parallel wave generation in end-to-end text-to-speech[J].arXiv:1807. 07281,2018. [48] DONAHUE J,DIELEMAN S,BI?KOWSKI M,et al.End-to-end adversarial text-to-speech[J].arXiv:2006. 03575,2020. [49] KIM J,KONG J,SON J.Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech[C]//International Conference on Machine Learning,2021:5530-5540. [50] BALJEKAR P.Speech synthesis from found data[D].Pittsburgh,PA:Carnegie Mellon University,2018. [51] 陈虹洁.面向低资源场景的语音表示学习及其应用[D].西安:西北工业大学,2018. CHEN H J.Low-resource speech representation learning and its applications[D].Xi’an:Northwestern Polytechnical University,2018. [52] TITS N,EL HADDAD K,DUTOIT T.Exploring transfer learning for low resource emotional TTS[C]//Proceedings of SAI Intelligent Systems Conference.Cham:Springer,2019:52-60. [53] AZIZAH K,ADRIANI M,JATMIKO W.Hierarchical transfer learning for multilingual,multi-speaker,and style transfer DNN-based TTS on low-resource languages[J].IEEE Access,2020,8:179798-179812. [54] HUYBRECHTS G,MERRITT T,COMINI G,et al.Low-resource expressive text-to-speech using data augmentation[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing,2021:6593-6597. [55] BYAMBADORJ Z,NISHIMURA R,AYUSH A,et al.Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation[J].EURASIP Journal on Audio,Speech,and Music Processing,2021(1):1-20. [56] OH S,KWON O,HWANG M J,et al.Effective data augmentation methods for neural text-to-speech systems[C]//2022 International Conference on Electronics,Information,and Communication(ICEIC),2022:1-4. [57] RIBEIRO M S,ROTH J,COMINI G,et al.Cross-speaker style transfer for text-to-speech using data augmentation[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing,2022:6797-6801. [58] HWANG M J,YAMAMOTO R,SONG E,et al.TTS-by-TTS:TTS-driven data augmentation for fast and high-quality speech synthesis[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing,2021:6598-6602. [59] NACHMANI E,WOLF L.Unsupervised polyglot text-to-speech[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:7055-7059. [60] ZHANG H,LIN Y.Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages[J].arXiv:2008.04549,2020. [61] CHUNG Y A,WANG Y,HSU W N,et al.Semi-supervised training for improving data efficiency in end-to-end speech synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:6940-6944. [62] HUANG S F,LIN C J,LIU D,et al.Meta-TTS:meta-learning for few-shot speaker adaptive text-to-speech[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022. [63] LUX F,VU N T.Language-agnostic meta-learning for low-resource text-to-speech with articulatory features[J].arXiv:2203.03191,2022. [64] HEDDERICH M A,LANGE L,ADEL H,et al.A survey on recent approaches for natural language processing in low-resource scenarios[J].arXiv:2010.12309,2020. [65] HAYASHI T,WATANABE S,TODA T,et al.Pre-trained text embeddings for enhanced text-to-speech synthesis[C]//Interspeech,2019:4430-4434. [66] DEVLIN J,CHANG M W,LEE K,et al.Bert:pretraining of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [67] ZHANG Y,DENG L,WANG Y.Unified mandarin TTS front-end based on distilled bert model[J].arXiv:2012. 15404,2020. [68] FANG W,CHUNG Y A,GLASS J.Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models[J].arXiv:1906.07307,2019. [69] JIA Y,ZEN H,SHEN J,et al.Png BERT:augmented BERT on phonemes and graphemes for neural TTS[J].arXiv:2103.15060,2021. [70] KASTNER K,SANTOS J F,BENGIO Y,et al.Representation mixing for TTS synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:5906-5910. [71] HEMATI H,BORTH D.Using ipa-based tacotron for data efficient cross-lingual speaker adaptation and pronunciation enhancement[J].arXiv:2011.06392,2020. [72] TERASHIMA R,YAMAMOTO R,SONG E,et al.Cross-speaker emotion transfer for low-resource text-to-speech using non-parallel voice conversion with pitch-shift data augmentation[J].arXiv:2204.10020,2022. [73] REN Y,TAN X,QIN T,et al.Almost unsupervised text to speech and automatic speech recognition[C]//International Conference on Machine Learning,2019:5410-5419. [74] COOPER E,WANG X,ZHAO Y,et al.Pretraining strategies,waveform model choice,and acoustic configurations for multi-speaker end-to-end speech synthesis[J].arXiv:2011.04839,2020. [75] BAGCHI D,HARTMANN W.Learning from the best:a teacher-student multilingual framework for low-resource languages[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:6051-6055. [76] YE Z,ZHAO Z,REN Y,et al.SyntaSpeech:syntax-aware generative adversarial text-to-speech[J].arXiv:2204.11792,2022. [77] KUMAR K,KUMAR R,DE BOISSIERE T,et al.Melgan:generative adversarial networks for conditional waveform synthesis[C]//Advances in Neural Information Processing Systems,2019. [78] YAMAMOTO R,SONG E,KIM J M.Parallel WaveGAN:a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing,2020:6199-6203. [79] OORD A,LI Y,BABUSCHKIN I,et al.Parallel wavenet:fast high-fidelity speech synthesis[C]//International Conference on Machine Learning,2018:3918-3926. [80] KIM S,LEE S,SONG J,et al.FloWaveNet:a generative flow for raw audio[J].arXiv:1811.02155,2018. [81] KINGMA D P,DHARIWAL P.Glow:generative flow with invertible 1×1 convolutions[C]//Advances in Neural Information Processing Systems,2018,31. [82] PRENGER R,VALLE R,CATANZARO B.Waveglow:a flow-based generative network for speech synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:3617-3621. [83] VALIN J M,SKOGLUND J.LPCNet:improving neural speech synthesis through linear prediction[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing:5891-5895. [84] MUSTAFA A,PIA N,FUCHS G.StyleMelGAN:an efficient high-fidelity adversarial vocoder with temporal adaptive normalization[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing,2021:6034-6038. [85] YANG J,LEE J,KIM Y,et al.VocGAN:a high-fidelity real-time vocoder with a hierarchically-nested adversarial network[J].arXiv:2007.15256,2020. [86] KONG J,KIM J,BAE J.Hifi-GAN:generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems,2020:17022-17033. [87] SU J,JIN Z,FINKELSTEIN A.HiFi-GAN-2:studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features[C]//2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,2021:166-170. [88] SHI Y,BU H,XU X,et al.Aishell-3:a multi-speaker mandarin TTS corpus and the baselines[J].arXiv:2010. 11567,2020. [89] RINGEVAL F,SONDEREGGER A,SAUER J,et al.Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions[C]//2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition,2013:1-8. [90] KOSSAIFI J,WALECKI R,PANAGAKIS Y,et al.Sewa DB:a rich database for audio-visual emotion and sentiment research in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(3):1022-1040. [91] ZHOU K,SISMAN B,LIU R,et al.Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing,2021:920-924. [92] BURKHARDT F,PAESCHKE A,ROLFES M,et al.A database of German emotional speech[C]//Interspeech,2005:1517-1520. [93] BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:interactive emotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42(4):335-359. [94] TU T,CHEN Y J,YEH C,et al.End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning[J].arXiv:1904.06508,2019. [95] YANG H,CHEN H,ZHOU H,et al.Enhancing cross-lingual transfer by manifold mixup[J].arXiv:2205.04182,2022. [96] FAN Y,QIAN Y,SOONG F K,et al.Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing,2015:4475-4479. [97] YANG J,HE L.Towards universal text-to-speech[C]//Interspeech,2020:3171-3175. [98] GUTKIN A.Uniform multilingual multi-speaker acoustic model for statistical parametric speech synthesis of low-resourced languages[C]//Interspeech,2017:2183-2187. [99] YU Q,LIU P,WU Z,et al.Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing,2016:5545-5549. [100] CAO Y,WU X,LIU S,et al.End-to-end code-switched tts with mix of monolingual recordings[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:6935-6939. [101] TJANDRA A,SAKTI S,NAKAMURA S.Listening while speaking:speech chain by deep learning[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU),2017:301-308. [102] TJANDRA A,SAKTI S,NAKAMURA S.Machine speech chain with one-shot speaker adaptation[J].arXiv:1803. 10525,2018. [103] 侯俊龙,潘文林.基于元度量学习的低资源语音识别[J].云南民族大学学报(自然科学版),2021,30(3):272-278. HOU J L,PAN W L.Low-resource speech recognition based on meta-metric learning[J].Journal of Yunnan University of Nationalities(Natural Sciences Edition),2021,30(3):272-278. [104] NEKVINDA T,DU?EK O.One model,many languages:meta-learning for multilingual text-to-speech[J].arXiv:2008.00768,2020. [105] PLATANIOS E A,SACHAN M,NEUBIG G,et al.Contextual parameter generation for universal neural machine translation[J].arXiv:1808.08493,2018. [106] CHEN Y,ASSAEL Y,SHILLINGFORD B,et al.Sample efficient adaptive text-to-speech[J].arXiv:1809.10460,2018. [107] HU Q,MARCHI E,WINARSKY D,et al.Neural text-to-speech adaptation from low quality public recordings[C]//Speech Synthesis Workshop,2019. [108] FINN C,ABBEEL P,LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning,2017:1126-1135. [109] COOPER E.Text-to-speech synthesis using found data for low-resource languages[D].Columbia University,2019. [110] YAN Y,TAN X,LI B,et al.Adaspeech 3:adaptive text to speech for spontaneous style[J].arXiv:2107. 02530,2021. [111] VALENTINI-BOTINHAO C,YAMAGISHI J.Speech enhancement of noisy and reverberant speech for text-to-speech[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2018,26(8):1420-1433. [112] ZHANG C,REN Y,TAN X,et al.Denoispeech:denoising text to speech with frame-level noise modeling[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing,2021:7063-7067. [113] ZHANG Z,TIAN Q,LU H,et al.Adadurian:few-shot adaptation for neural text-to-speech with durian[J].arXiv:2005.05642,2020. [114] YAMAGISHI J,NOSE T,ZEN H,et al.Robust speaker-adaptive HMM-based text-to-speech synthesis[J].IEEE Transactions on Audio,Speech,and Language Processing,2009,17(6):1208-1230. [115] CHEN M,TAN X,LI B,et al.Adaspeech:adaptive text to speech for custom voice[J].arXiv:2103.00993,2021. [116] XIN D,SAITO Y,TAKAMICHI S,et al.Cross-lingual speaker adaptation using domain adaptation and speaker consistency loss for text-to-speech synthesis[J].Proceedings of Interspeech,2021:1614-1618. [117] YAN Y,TAN X,LI B,et al.Adaspeech 2:adaptive text to speech with untranscribed data[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2021:6613-6617. [118] 徐志航,陈博,张辉,等.小数据下的音素级别说话人嵌入的语音合成自适应方法[J].计算机学报,2022,45(5):1003-1017. XU Z H,CHEN B,ZHANG H,et al.Speech synthesis adaption method based on phoneme-level speaker embedding under small data[J].Chinese Journal of Computers,2022,45(5):1003-1017. [119] SONG K,XUE H,WANG X,et al.AdaVITS:tiny VITS for low computing resource speaker adaptation[J].arXiv:2206.00208,2022. [120] CASANOVA E,SHULBY C,G?LGE E,et al.Sc-glowtts:an efficient zero-shot multi-speaker text-to-speech model[J].arXiv:2104.05557,2021. [121] CASANOVA E,WEBER J,SHULBY C,et al.YourTTS:towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone[J].arXiv:2112.02418,2021. [122] COOPER E,LAI C I,YASUDA Y,et al.Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing,2020:6184-6188. [123] LEI Y,YANG S,CONG J,et al.Glow-WaveGAN 2:high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion[J].arXiv:2207.01832,2022. [124] MANIATI G,ELLINAS N,MARKOPOULOS K,et al.Cross-lingual low resource speaker adaptation using phonological features[J].arXiv:2111.09075,2021. [125] WU Y,TAN X,LI B,et al.AdaSpeech 4:adaptive text to speech in zero-shot scenarios[J].arXiv:2204. 00436,2022. [126] ARIK S,CHEN J,PENG K,et al.Neural voice cloning with a few samples[C]//Advances in Neural Information Processing Systems,2018. [127] BLAAUW M,BONADA J,DAIDO R.Data efficient voice cloning for neural singing synthesis[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:6840-6844. [128] ZHANG Y,WEISS R J,ZEN H,et al.Learning to speak fluently in a foreign language:multilingual speech synthesis and cross-language voice cloning[J].arXiv:1907.04448,2019. [129] LI R,PU D,HUANG M,et al.Unet-TTS:improving unseen speaker and style transfer in one-shot voice cloning[J].arXiv:2109.11115,2021. [130] DAI D,CHEN Y,CHEN L,et al.Cloning one’s voice using very limited data in the wild[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing,2022:8322-8326. [131] KARLAPATI S,MOINET A,JOLY A,et al.Copycat:many-to-many fine-grained prosody transfer for neural text-to-speech[J].arXiv:2004.14617,2020. [132] ZHAO S,NGUYEN T H,WANG H,et al.Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion[J].arXiv:2010.08136,2020. [133] CHOU J,YEH C,LEE H.One-shot voice conversion by separating speaker and content representations with instance normalization[J].arXiv:1904.05742,2019. [134] WU D Y,LEE H.One-shot voice conversion by vector quantization[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing,2020:7734-7738. [135] BAAS M,KAMPER H.StarGAN-ZSVC:towards zero-shot voice conversion in low-resource contexts[C]//Southern African Conference for Artificial Intelligence Research.Cham:Springer,2021:69-84. [136] LATORRE J,LACHOWICZ J,LORENZO-TRUEBA J,et al.Effect of data reduction on sequence-to-sequence neural tts[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing,2019:7075-7079. [137] CHANG H J,LEE H,LEE L.Towards lifelong learning of end-to-end ASR[J].arXiv:2104.01616,2021. [138] YANG M,DING S,CHEN T,et al.Towards lifelong learning of multilingual text-to-speech synthesis[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing,2022:8022-8026. [139] HEMATI H,BORTH D.Continual speaker adaptation for text-to-speech synthesis[J].arXiv:2103.14512,2021. [140] BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[C]//Advances in Neural Information Processing Systems,2020:1877-1901. [141] JEONG M,KIM H,CHEON S J,et al.Difftts:a denoising diffusion model for text-to-speech[J].arXiv:2104. 01409,2021. [142] LAM M W Y,WANG J,SU D,et al.BDDM:bilateral denoising diffusion models for fast and high-quality speech synthesis[J].arXiv:2203.13508,2022. [143] KANG M,MIN D,HWANG S J.Any-speaker adaptive text-to-speech synthesis with diffusion models[J].arXiv:2211.09383,2022. [144] HUANG R,LAM M W Y,WANG J,et al.FastDiff:a fast conditional diffusion model for high-quality speech synthesis[J].arXiv:2204.09934,2022. [145] POPOV V,VOVK I,GOGORYAN V,et al.Grad-TTS:a diffusion probabilistic model for text-to-speech[C]//International Conference on Machine Learning,2021:8599-8608. [146] KONG Z,PING W,HUANG J,et al.Diffwave:a versatile diffusion model for audio synthesis[J].arXiv:2009. 09761,2020. [147] COOPER E,LAI C I,YASUDA Y,et al.Can speaker augmentation improve multi-speaker end-to-end TTS?[J].arXiv:2005.01245,2020. |
[1] | LIU Tao, KE Zunwang, Wushour·Silamu. Survey of Few-Shot Relation Classification [J]. Computer Engineering and Applications, 2023, 59(9): 1-12. |
[2] | SI Weiwei, CEN Jian, WU Yinbo, HU Xueliang, HE Minzan, YANG Zhuohong, CHEN Honghua. Review of Research on Bearing Fault Diagnosis with Small Samples [J]. Computer Engineering and Applications, 2023, 59(6): 45-56. |
[3] | CHEN Jingxia, TANG Zhezhe, LIN Wentao, HU Kailei, XIE Jia. Self-Attention GAN for EEG Data Augmentation and Emotion Recognition [J]. Computer Engineering and Applications, 2023, 59(5): 160-168. |
[4] | ZHAO Xuebing, WANG Junjie. Bridge Crack Detection Based on Improved DeeplabV3+ and Migration Learning [J]. Computer Engineering and Applications, 2023, 59(5): 262-269. |
[5] | LIN Lingde, LIU Na, WANG Zheng'an. Review of Research on Adapter and Prompt Tuning [J]. Computer Engineering and Applications, 2023, 59(2): 12-21. |
[6] | QIN Tianpeng, SHENG Hui, YUE Lu, JIN Wei. Review of Research on Emotion Recognition Based on EEG Signals [J]. Computer Engineering and Applications, 2023, 59(15): 38-54. |
[7] | LI Shuo, GU Yijun, TAN Hao, PENG Shufan. Research on Voiceprint Adversarial Detection of Improved Xception Network [J]. Computer Engineering and Applications, 2023, 59(14): 232-241. |
[8] | FANG Guoqing, ZHANG Yaxian, YU Dan, MA Yao, CHEN Yongle. Research on Cross-Protocol Industrial Control Intrusion Detection System [J]. Computer Engineering and Applications, 2023, 59(14): 251-259. |
[9] | TANG Lihua, LU Ning, LAN Chuangchuang, CHEN Ronghua, WU Jiansheng. Predicting Bioactivities of Ligands Acting with G Protein-Coupled Receptors via Deep Transfer Learning [J]. Computer Engineering and Applications, 2023, 59(13): 120-128. |
[10] | LIU Tao, DING Xueyan, ZHANG Bingbing, ZHANG Jianxin. Improved YOLOv5 for Remote Sensing Image Detection [J]. Computer Engineering and Applications, 2023, 59(10): 253-261. |
[11] | DENG Xue, ZHAO Hao, ZHANG Jing, MEI Boping, ZHANG Hua. Research on Offline Data Augmentation Method Jointed with Cannikin’s Law [J]. Computer Engineering and Applications, 2023, 59(1): 207-212. |
[12] | CHEN Yidong, LU Zhonghua. Forecasting CPI Based on Convolutional Neural Network and Long Short-Term Memory Network [J]. Computer Engineering and Applications, 2022, 58(9): 256-262. |
[13] | SUN Xiaodong, YANG Dongqiang. Review of Application of Data Augmentation Strategy in English Grammar Error Correction [J]. Computer Engineering and Applications, 2022, 58(7): 43-54. |
[14] | WANG Bin, LI Xin. Research on Multi-Source Domain Adaptive Algorithm Integrating Dynamic Residuals [J]. Computer Engineering and Applications, 2022, 58(7): 162-166. |
[15] | ZHANG Ming, LU Qinghua, HUANG Yuanzhong, LI Ruixuan. Recent Advances and Challenges on Grammatical Error Correction in Natural Language [J]. Computer Engineering and Applications, 2022, 58(6): 29-41. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||