Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (11): 295-301.DOI: 10.3778/j.issn.1002-8331.2112-0333

• Engineering and Applications • Previous Articles     Next Articles

DTZH1505:Large Scale Open Source Mandarin Speech Corpus

WANG Dong, WANG Liyuan, WANG Daliang, QI Hongwei   

  1. 1.College of Information Engineering, Xizang Minzu University, Xianyang, Shaanxi 712082, China
    2.Datatang (Beijing) Technology Co., Ltd., Beijing 100192, China
  • Online:2022-06-01 Published:2022-06-01



  1. 1.西藏民族大学 信息工程学院,陕西 咸阳 712082
    2.数据堂(北京)科技股份有限公司,北京 100192

Abstract: In recent years, deep learning has made a breakthrough in the field of speech recognition, and pushes forward the wide application of speech recognition technology in people’s daily lives. Further optimization of the speech recognition model needs to be supported by a larger scale calibrated data. However, the scale of the current open source audio data set is still too small, and corpus is mostly written language of news-based long texts. This paper, by talking about the popular speech recognition applications like human-computer interaction and intelligent customer service, builds and opens the largest ever Chinese Mandarin speech corpus DTZH1505 through crowdsourcing. Data set records natural speech of 6?408 speakers from 8 major Chinese dialect regions and 33 provinces, up to 1?505 hours and on various scenes like social networking, human-computer interaction, intelligent customer service and on-board commands. It can be widely used in the researches of corpus linguistics, conversation analysis, speech recognition, as well as speaker recognition. This paper implements a series benchmark speech recognition experiments, and the results show that:compared to the same scale Chinese speech corpus aishell2, the speech recognition model based on this data set has better performance.

Key words: mandarin speech corpus, open source data, speech recognition, deep learning, phoneme balance

摘要: 近年来,深度学习在语音识别领域取得了突破性进展,并推动语音识别技术广泛应用到人们的日常生活中。语音识别模型的进一步优化需要更大规模标定数据的驱动,然而,目前开源的语音数据集规模仍太小,语料多为偏向书面用语的新闻类长文本。针对人机交互、智能客服等热门语音识别应用,通过众包模式采集朗读式语音,构建并开源了迄今为止最大规模的中文普通话语音数据集DTZH1505。数据集记录了6?408位来自中国八大方言地域、33个省份的说话人的自然语音,时长达1?505?h,语料内容涵盖社交聊天、人机交互、智能客服以及车载命令等,可广泛用于语料库语言学、会话分析、语音识别、说话人识别等研究。开展一系列基准语音识别实验,实验结果表明:相较于同规模中文语音数据集aishell2,基于此数据集训练的语音识别模型效果更好。

关键词: 中文普通话语音库, 开源数据, 语音识别, 深度学习, 音素平衡