基于分离对比学习的个性化语音合成

doi:10.3778/j.issn.1002-8331.2306-0127

摘要/Abstract

摘要： 个性化语音合成是指根据目标说话人的参考语音，合成具有目标说话人风格的语音。参考语音同时依赖于目标说话人风格和语音中的文本内容。现有方法将参考语音作为一个整体进行对比分析，但是在说话人风格和语言内容两方面没有进行分离对比分析，导致了合成语音受到语言内容的干扰，而偏离目标说话人风格的问题。设计了一种风格与内容分离的对比损失，用于个性化语音合成模型。该模型包括风格-内容分离对比模块、说话人模块、语音解码器模块。风格-内容分离对比模块将查询参考语音中的风格和内容视为正例，并使用风格-内容分离的负例。该分离负例能够促使查询的风格和其他参考语音中的内容分离，同时能够促使查询的内容与其他参考语音中的风格分离。风格内容分离对比模块用于学习兼顾风格-内容的语音特征。说话人模块学习说话人身份特征，并用于引导说话人风格学习。语音解码器模块融合风格-内容的语音特征和说话人身份特征，用于提高对持续时间、音高、能量这些说话人风格的描述能力。在VCTK和LibriTTS两个数据集上的实验表明，该方法提高了合成语音的说话人相似度，合成语音的质量优于现有方法。

关键词: 语音合成, 分离对比学习, 说话人风格

Abstract: Personalized speech synthesis uses the reference voice of the target speaker to synthesize the speech with the style of this speaker. The reference voice is dependent on both the speaker style and the text content described in the voice. Existing methods consider the reference voice as an entire feature for contrastive learning. They neglect to consider the two aspects of the speaker style and the text content with separated contrastive learning. This paper designs a style-content separated contrastive loss to construct a personalized speech synthesis model, which consists of a style-content separated contrastive module, a speaker module, and a speech decoder module. The style-content separated contrastive module samples the positive set with the style and the content in the query reference voice, and samples the negative set by separating the style and the content. The separated negative set encourages the separation between the query content and the style in the other reference voice, and encourages the separation between the query style and the content in the other reference voice. This module learns the style-content speech feature. The speaker module learns the speaker identity feature, which guides speaker style learning. The speech decoder module fuses the speech feature and the speaker identity feature, which can enhance the synthesized speech with speaker style covering the voice duration, pitch, and energy. Experiments on two datasets, VCTK and LibriTTS, show that the proposed method improves the speaker similarity of the synthesized speech, and the quality of synthesized speech outperforms the state-of-the-art methods.

Key words: speech synthesis, separated contrastive learning, speaker style

尚影, 韩超, 吴克伟. 基于分离对比学习的个性化语音合成[J]. 计算机工程与应用, 2023, 59(22): 158-165.

SHANG Ying, HAN Chao, WU Kewei. Personalized Speech Synthesis Based on Separated Contrastive Learning[J]. Computer Engineering and Applications, 2023, 59(22): 158-165.

参考文献

[1] WANG Y，STANTON D，ZHANG Y，et al.Style tokens：unsupervised style modeling，control and transfer in end-to-end speech synthesis[C]//Proceedings of the 35th International Conference on Machine Learning，Stockholm，Jul 10-15，2018：5180-5189.
[2] JIA Y，ZHANG Y，WEISS R，et al.Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]//Advances in Neural Information Processing Systems 31，Montreal，Dec 3-8，2018：4485-4495.
[3] PAUL D，PANTAZIS Y，STYLIANOU Y.Speaker conditional WaveRNN：towards universal neural vocoder for unseen speaker and recording conditions[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association，Shanghai，Oct 25-29，2020：235-239.
[4] CHOI S，HAN S，KIM D，et al.Attentron：few-shot text-to-speech utilizing attention-based variable-length embedding[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association，Shanghai，Oct 25-29，2020：2007-2011.
[5] CHEN M，TAN X，LI B.AdaSpeech：adaptive text to speech for custom voice[C]//Proceedings of the 9th International Conference on Learning Representations，May 3-7，2021：1-10.
[6] MIN D，LEE D B，YANG E，et al.Meta-StyleSpeech：multi-speaker adaptive text-to-speech generation[C]//Proceedings of the 2021 International Conference on Machine Learning，Jul 18-24，2021：7748-7759.
[7] HUANG S F，LIN C J，LIU D R，et al.Meta-TTS：meta-learning for few-shot speaker adaptive text-to-speech[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2022，30：1558-1571.
[8] MENG Y，LI X，WU Z，et al.CALM：contrastive cross-modal speaking style modeling for expressive text-to-speech synthesis[C]//Proceedings of the 23rd Annual Conference of the International Speech Communication Association，Incheon，Sep 18-22，2022：5533-5537.
[9] MOSS H B，AGGARWAL V，PRATEEK N，et al.Boffin TTS：few-shot speaker adaptation by Bayesian optimization[C]//Proceedings of the 2020 International Conference on Acoustics，Speech and Signal Processing，Barcelona，May 4-8，2020：7639-7643.
[10] CHEN Y，ASSAEL Y，SHILLINGFORD B，et al.Sample efficient adaptive text-to-speech[C]//Proceedings of the 7th International Conference on Learning Representations，May 6-9，2019：1-16.
[11] 徐志航，陈博，张辉，等.小数据下的音素级别说话人嵌入的语音合成自适应方法[J].计算机学报，2022，45（5）：1003-1017.
XU Z H，CHEN B，ZHANG H，et al.Speech synthesis adaption method based on phoneme-level speaker embedding under small data[J].Chinese Journal of Computers，2022，45（5）：1003-1017.
[12] NACHMANI E，POLYAK A，TAIGMAN Y，et al.Fitting new speakers based on a short untranscribed sample[C]//Proceedings of the 35th International Conference on Machine Learning，Stockholm，Jul 10-15，2018：3683-3691.
[13] ZHOU Y，SONG C，LI X，et al.Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis[C]//Proceedings of the 23rd Annual Conference of the International Speech Communication Association，Incheon，Sep 18-22，2022：2573-2577.
[14] ZHANG Y，CONG J，XUE H，et al.VISinger：variational inference with adversarial learning for end-to-end singing voice synthesis[C]//Proceedings of the 2022 International Conference on Acoustics，Speech and Signal Processing，Singapore，May 23-27，2022：7237-7241.
[15] CHEN T，KORNBLITH S，NOROUZI M，et al.A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning，Jul 13-18，2020：1597-1607.
[16] KHOSLA P，TETERWAK P，WANG C，et al.Supervised contrastive learning[C]//Advances in Neural Information Processing Systems 33，Dec 6-12，2020：1-23.
[17] GAO T，YAO X，CHEN D.SimCSE：simple contrastive learning of sentence embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing，Punta Cana，Nov 7-11，2021：6894-6910.
[18] LI W，GAO C，NIU G，et al.UNIMO：towards unified-modal understanding and generation via cross-modal contrastive learning[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing，Aug 1-6，2021：2592-2607.
[19] REN Y，HU C，TAN X，et al.FastSpeech 2：fast and high-quality end-to-end text to speech[C]//Proceedings of the 9th International Conference on Learning Representations，May 3-7，2021：1-15.
[20] GULATI A，QIN J，CHIU C C，et al.Conformer：convolution-augmented transformer for speech recognition[C]//Proceedings of the 21st Annual Conference of the International Speech Communication Association，Shanghai，Oct 25-29，2020：5036-5040.
[21] VEAUX C，YAMAGISHI J，MACDONALD K，et al.Superseded-CSTR VCTK corpus：English multi-speaker corpus for CSTR voice cloning toolkit[EB/OL].University of Edinburgh.The Centre for Speech Technology Research（2016）[2023-05-12].https：//datashare.ed.ac.uk/handle/10283/2651.
[22] ZEN H，DANG V，CLARK R，et al.LibriTTS：a corpus derived from LibriSpeech for text-to-speech[C]//Proceedings of the 20th Annual Conference of the International Speech Communication Association，Graz，Sep 15-19，2019：1526-1530.
[23] KONG J，KIM J，BAE J.HiFi-GAN：generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems 33，Dec 6-12，2020：1-14.