Personalized Speech Synthesis Based on Separated Contrastive Learning
SHANG Ying, HAN Chao, WU Kewei
1.School of Elementary Education, Fuyang Early Childhood Teacher’s College, Fuyang, Anhui 236015, China
2.School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
[1] WANG Y,STANTON D,ZHANG Y,et al.Style tokens:unsupervised style modeling,control and transfer in end-to-end speech synthesis[C]//Proceedings of the 35th International Conference on Machine Learning,Stockholm,Jul 10-15,2018:5180-5189.
[2] JIA Y,ZHANG Y,WEISS R,et al.Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]//Advances in Neural Information Processing Systems 31,Montreal,Dec 3-8,2018:4485-4495.
[3] PAUL D,PANTAZIS Y,STYLIANOU Y.Speaker conditional WaveRNN:towards universal neural vocoder for unseen speaker and recording conditions[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association,Shanghai,Oct 25-29,2020:235-239.
[4] CHOI S,HAN S,KIM D,et al.Attentron:few-shot text-to-speech utilizing attention-based variable-length embedding[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association,Shanghai,Oct 25-29,2020:2007-2011.
[5] CHEN M,TAN X,LI B.AdaSpeech:adaptive text to speech for custom voice[C]//Proceedings of the 9th International Conference on Learning Representations,May 3-7,2021:1-10.
[6] MIN D,LEE D B,YANG E,et al.Meta-StyleSpeech:multi-speaker adaptive text-to-speech generation[C]//Proceedings of the 2021 International Conference on Machine Learning,Jul 18-24,2021:7748-7759.
[7] HUANG S F,LIN C J,LIU D R,et al.Meta-TTS:meta-learning for few-shot speaker adaptive text-to-speech[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,30:1558-1571.
[8] MENG Y,LI X,WU Z,et al.CALM:contrastive cross-modal speaking style modeling for expressive text-to-speech synthesis[C]//Proceedings of the 23rd Annual Conference of the International Speech Communication Association,Incheon,Sep 18-22,2022:5533-5537.
[9] MOSS H B,AGGARWAL V,PRATEEK N,et al.Boffin TTS:few-shot speaker adaptation by Bayesian optimization[C]//Proceedings of the 2020 International Conference on Acoustics,Speech and Signal Processing,Barcelona,May 4-8,2020:7639-7643.
[10] CHEN Y,ASSAEL Y,SHILLINGFORD B,et al.Sample efficient adaptive text-to-speech[C]//Proceedings of the 7th International Conference on Learning Representations,May 6-9,2019:1-16.
[11] 徐志航,陈博,张辉,等.小数据下的音素级别说话人嵌入的语音合成自适应方法[J].计算机学报,2022,45(5):1003-1017.
XU Z H,CHEN B,ZHANG H,et al.Speech synthesis adaption method based on phoneme-level speaker embedding under small data[J].Chinese Journal of Computers,2022,45(5):1003-1017.
[12] NACHMANI E,POLYAK A,TAIGMAN Y,et al.Fitting new speakers based on a short untranscribed sample[C]//Proceedings of the 35th International Conference on Machine Learning,Stockholm,Jul 10-15,2018:3683-3691.
[13] ZHOU Y,SONG C,LI X,et al.Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis[C]//Proceedings of the 23rd Annual Conference of the International Speech Communication Association,Incheon,Sep 18-22,2022:2573-2577.
[14] ZHANG Y,CONG J,XUE H,et al.VISinger:variational inference with adversarial learning for end-to-end singing voice synthesis[C]//Proceedings of the 2022 International Conference on Acoustics,Speech and Signal Processing,Singapore,May 23-27,2022:7237-7241.
[15] CHEN T,KORNBLITH S,NOROUZI M,et al.A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning,Jul 13-18,2020:1597-1607.
[16] KHOSLA P,TETERWAK P,WANG C,et al.Supervised contrastive learning[C]//Advances in Neural Information Processing Systems 33,Dec 6-12,2020:1-23.
[17] GAO T,YAO X,CHEN D.SimCSE:simple contrastive learning of sentence embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Punta Cana,Nov 7-11,2021:6894-6910.
[18] LI W,GAO C,NIU G,et al.UNIMO:towards unified-modal understanding and generation via cross-modal contrastive learning[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing,Aug 1-6,2021:2592-2607.
[19] REN Y,HU C,TAN X,et al.FastSpeech 2:fast and high-quality end-to-end text to speech[C]//Proceedings of the 9th International Conference on Learning Representations,May 3-7,2021:1-15.
[20] GULATI A,QIN J,CHIU C C,et al.Conformer:convolution-augmented transformer for speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association,Shanghai,Oct 25-29,2020:5036-5040.
[21] VEAUX C,YAMAGISHI J,MACDONALD K,et al.Superseded-CSTR VCTK corpus:English multi-speaker corpus for CSTR voice cloning toolkit[EB/OL].University of Edinburgh.The Centre for Speech Technology Research(2016)[2023-05-12].https://datashare.ed.ac.uk/handle/10283/2651.
[22] ZEN H,DANG V,CLARK R,et al.LibriTTS:a corpus derived from LibriSpeech for text-to-speech[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association,Graz,Sep 15-19,2019:1526-1530.
[23] KONG J,KIM J,BAE J.HiFi-GAN:generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems 33,Dec 6-12,2020:1-14.
LI Jin-long,YANG Hong-wu,LIANG Qing-qing,PEI Dong,LIU Hui-juan.
Lyrics to singing voice synthesis system
[J]. Computer Engineering and Applications, 2010, 46(16): 124-126.