计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (19): 146-153.

• 模式识别与人工智能 • 上一篇    下一篇

面向汉语统计参数语音合成的标注生成方法

郝东亮,杨鸿武,张  策,张  帅,郭立钊,杨静波   

  1. 西北师范大学 物理与电子工程学院,兰州 730070
  • 出版日期:2016-10-01 发布日期:2016-11-18

Label generation for Chinese statistical parametric speech synthesis

HAO Dongliang, YANG Hongwu, ZHANG Ce, ZHANG Shuai, GUO Lizhao, YANG Jingbo   

  1. College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070, China
  • Online:2016-10-01 Published:2016-11-18

摘要: 针对汉语统计参数语音合成中的上下文相关标注生成,设计了声韵母层、音节层、词层、韵律词层、韵律短语层和语句层6层上下文相关的标注格式。对输入的中文语句进行文本规范并利用语法分析获得语句的结构和分词信息;通过字音转换获得每个汉字的声韵母及声调;利用TBL(Transformation-Based error driven Learning)算法预测输入文本的韵律词边界和韵律短语边界。在此基础上,获得输入文本中每个汉字的声韵母信息及其上下文结构信息,从而产生统计参数语音合成所需的上下文相关标注。设计了一个以声韵母为合成基元的普通话的基于隐Markov模型(HMM)的统计参数语音合成系统,通过主、客观实验评测了不同标注信息对合成语音音质的影响,结果表明,上下文相关的标注信息越丰富,合成语音的音质越好。

关键词: 文本分析, 语音合成, 上下文相关标注, 韵律预测, 字音转换

Abstract: This paper designs a six-level context-dependent label format, which includes an initial and final level, a syllable level, a word level, a prosodic word level, a prosody phrase level and a sentence level, for Chinese statistical parametric speech synthesis. The input Chinese sentence is firstly normalized and performs grammar analysis to obtain sentence structure and word segmentation information. Then the initial, final and tone of Chinese character are obtained by grapheme-to-phoneme conversion. The Transformation-Based error driven Learning(TBL) algorithm is finally employed to predict the prosodic word boundary and prosodic phrase boundary of the input sentence. Context-dependent labels of each sentence for statistical parametric speech synthesis are generated according to the context information obtained from above text analysis and prosodic prediction procedures. A Hidden Markov Model(HMM) based Mandarin statistical parametric speech synthesis is designed to evaluate the influences of different labels on quality of synthesized speech. Tests show that more context-dependent label information can achieve higher quality of synthesized speech.

Key words: text analysis, speech synthesis, context-dependent label, prosodic prediction, grapheme-to-phoneme conversion