计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (22): 158-165.DOI: 10.3778/j.issn.1002-8331.2306-0127

• 模式识别与人工智能 • 上一篇    下一篇

基于分离对比学习的个性化语音合成

尚影,韩超,吴克伟   

  1. 1.阜阳幼儿师范高等专科学校 小学教育学院,安徽 阜阳 236015
    2.合肥工业大学 计算机与信息学院,合肥 230601
  • 出版日期:2023-11-15 发布日期:2023-11-15

Personalized Speech Synthesis Based on Separated Contrastive Learning

SHANG Ying, HAN Chao, WU Kewei   

  1. 1.School of Elementary Education, Fuyang Early Childhood Teacher’s College, Fuyang, Anhui 236015, China
    2.School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
  • Online:2023-11-15 Published:2023-11-15

摘要: 个性化语音合成是指根据目标说话人的参考语音,合成具有目标说话人风格的语音。参考语音同时依赖于目标说话人风格和语音中的文本内容。现有方法将参考语音作为一个整体进行对比分析,但是在说话人风格和语言内容两方面没有进行分离对比分析,导致了合成语音受到语言内容的干扰,而偏离目标说话人风格的问题。设计了一种风格与内容分离的对比损失,用于个性化语音合成模型。该模型包括风格-内容分离对比模块、说话人模块、语音解码器模块。风格-内容分离对比模块将查询参考语音中的风格和内容视为正例,并使用风格-内容分离的负例。该分离负例能够促使查询的风格和其他参考语音中的内容分离,同时能够促使查询的内容与其他参考语音中的风格分离。风格内容分离对比模块用于学习兼顾风格-内容的语音特征。说话人模块学习说话人身份特征,并用于引导说话人风格学习。语音解码器模块融合风格-内容的语音特征和说话人身份特征,用于提高对持续时间、音高、能量这些说话人风格的描述能力。在VCTK和LibriTTS两个数据集上的实验表明,该方法提高了合成语音的说话人相似度,合成语音的质量优于现有方法。

关键词: 语音合成, 分离对比学习, 说话人风格

Abstract: Personalized speech synthesis uses the reference voice of the target speaker to synthesize the speech with the style of this speaker. The reference voice is dependent on both the speaker style and the text content described in the voice. Existing methods consider the reference voice as an entire feature for contrastive learning. They neglect to consider the two aspects of the speaker style and the text content with separated contrastive learning. This paper designs a style-content separated contrastive loss to construct a personalized speech synthesis model, which consists of a style-content separated contrastive module, a speaker module, and a speech decoder module. The style-content separated contrastive module samples the positive set with the style and the content in the query reference voice, and samples the negative set by separating the style and the content. The separated negative set encourages the separation between the query content and the style in the other reference voice, and encourages the separation between the query style and the content in the other reference voice. This module learns the style-content speech feature. The speaker module learns the speaker identity feature, which guides speaker style learning. The speech decoder module fuses the speech feature and  the speaker identity feature, which can enhance the synthesized speech with speaker style covering the voice duration, pitch, and energy. Experiments on two datasets, VCTK and LibriTTS, show that the proposed method improves the speaker similarity of the synthesized speech, and the quality of synthesized speech outperforms the state-of-the-art methods.

Key words: speech synthesis, separated contrastive learning, speaker style