Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (15): 1-16.DOI: 10.3778/j.issn.1002-8331.2211-0322

• Research Hotspots and Reviews • Previous Articles     Next Articles

Review of Speech Synthesis Methods Under Low-Resource Condition

ZHANG Jialin, Mairidan Wushouer, Gulanbaier Tuerhong   

  1. School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Online:2023-08-01 Published:2023-08-01

低资源条件下的语音合成方法综述

张佳琳,买日旦·吾守尔,古兰拜尔·吐尔洪   

  1. 新疆大学 信息科学与工程学院,乌鲁木齐 830046

Abstract: Speech synthesis is a hot research direction in the field of human-computer interaction. Since the era of deep learning, its research focus has shifted from inefficient traditional methods to end-to-end speech synthesis technology based on neural networks. However, in the case of low data resources where it is difficult to collect minority language corpus data, target speaker speech training data or large emotional speech datasets, building a mature speech synthesis system is still a research difficulty. Therefore, the classic models of speech synthesis are introduced in categories, and the research status at home and abroad on low resource issues are systematically reviewed. From the perspective of the composition structure and model training of speech synthesis systems, the mainstream technologies to improve the overall performance of speech synthesis models in recent years are described respectively. It also summarizes various kinds of open source speech datasets that are applicable to different tasks of speech synthesis including multi-language, multi-emotion and multi-speaker. This paper summarizes, analyzes and compares the advantages and disadvantages of low resource speech synthesis methods using deep learning and machine learning, such as transfer learning, meta learning, data augmentation, etc. This paper also briefly introduces speaker adaptation, voice cloning and conversion technologies in few-shot scenario. Finally, the feasible research directions to alleviate the problem of low resource speech synthesis are discussed and prospected.

Key words: speech synthesis, low resource, data augmentation, transfer learning, meta learning, fine-tuning

摘要: 语音合成是人机交互领域的热门研究方向。深度学习时代以来,其研究重心由低效的传统方法转向基于神经网络的端到端语音合成技术,但在小语种语料数据、目标说话人语音训练数据或大型情感语音数据集收集困难的低数据资源情况下,构建成熟的语音合成系统仍是研究难点。故对语音合成的经典模型做分类介绍,围绕低资源问题的国内外研究现状做系统综述。从语音合成系统的组成结构与模型训练角度,分别阐述近年提升语音合成模型总体性能的主流技术,并总结了适用于语音合成不同任务的包含多种语言、多种情感、多位说话人的各类开源语音数据集。对应用深度学习和机器学习如迁移学习、元学习、数据增广等手段的解决低资源语音合成方法进行概述分析与优缺点比较,简要介绍少样本场景下的说话人自适应、语音克隆与转换等技术。对缓解低资源语音合成问题的可行研究方向进行探讨与展望。

关键词: 语音合成, 低资源, 数据增广, 迁移学习, 元学习, 微调