Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (24): 20-25.DOI: 10.3778/j.issn.1002-8331.1810-0354

Previous Articles     Next Articles

Hybrid unit seletion speech synthesis system target cost construction

CAI Wenbin1, WEI Yunlong1, XU Haihua2, PAN Lin1   

  1. 1.College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
    2.Temasek Laboratory, Nanyang Technological University, Singapore 639798, Singapore
  • Online:2018-12-15 Published:2018-12-14


蔡文彬1,魏云龙1,徐海华2,潘  林1   

  1. 1.福州大学 物理与信息工程学院,福州 350108
    2.南洋理工大学 Temasek实验室,新加坡 639798

Abstract: A general method of guiding concatenate unit selection is minimized the sum of target and concatenation cost. Since candidate units involve complex linguistic and acoustic properties, the key of improving synthesized speech quality is how to choose the acoustic(or lingustic) features that accurately represent units characteristics and constructed the corresponding target cost. This paper explores target cost construction from two aspects:acoustic characteristics(Mel-generalized cepstral, log fundamental frequency, bottle-neck feature) and acoustic models. Experimental results show that the Deep Neural Network(DNN) based acoustic model which trained by a big similar cropus and fine tuned with current cropus can predict more robust bottle-neck features, it can employ those features to participate the target cost calculation then guide optimal candidate units selection, and impove the quality of synthesized speech.

Key words: speech synthesis, target cost, acoustic features, acoustic models, concatenate unit

摘要: 合成语音的基元是通过最小化目标代价和拼接代价来选取。由于拼接基元涉及复杂的语言学、声学特性,如何选择能准确描述基元信息的声学特征(或语言学特征)并构建相应目标代价是提高合成语音质量的关键。从声学特征和声学模型两个方面对目标代价构建进行了探究。实验结果表明,经过相似语料训练后微调的深度声学网络模型,预测的瓶颈特征更能表征拼接基元特性,从而指导目标代价筛选理想候选单元,提高合成语音的质量。

关键词: 语音合成, 目标代价, 声学特征, 声学模型, 拼接基元