DTZH1505：Large Scale Open Source Mandarin Speech Corpus

doi:10.3778/j.issn.1002-8331.2112-0333

Abstract

Abstract: In recent years, deep learning has made a breakthrough in the field of speech recognition, and pushes forward the wide application of speech recognition technology in people’s daily lives. Further optimization of the speech recognition model needs to be supported by a larger scale calibrated data. However, the scale of the current open source audio data set is still too small, and corpus is mostly written language of news-based long texts. This paper, by talking about the popular speech recognition applications like human-computer interaction and intelligent customer service, builds and opens the largest ever Chinese Mandarin speech corpus DTZH1505 through crowdsourcing. Data set records natural speech of 6?408 speakers from 8 major Chinese dialect regions and 33 provinces, up to 1?505 hours and on various scenes like social networking, human-computer interaction, intelligent customer service and on-board commands. It can be widely used in the researches of corpus linguistics, conversation analysis, speech recognition, as well as speaker recognition. This paper implements a series benchmark speech recognition experiments, and the results show that：compared to the same scale Chinese speech corpus aishell2, the speech recognition model based on this data set has better performance.

Key words: mandarin speech corpus, open source data, speech recognition, deep learning, phoneme balance

摘要： 近年来，深度学习在语音识别领域取得了突破性进展，并推动语音识别技术广泛应用到人们的日常生活中。语音识别模型的进一步优化需要更大规模标定数据的驱动，然而，目前开源的语音数据集规模仍太小，语料多为偏向书面用语的新闻类长文本。针对人机交互、智能客服等热门语音识别应用，通过众包模式采集朗读式语音，构建并开源了迄今为止最大规模的中文普通话语音数据集DTZH1505。数据集记录了6?408位来自中国八大方言地域、33个省份的说话人的自然语音，时长达1?505?h，语料内容涵盖社交聊天、人机交互、智能客服以及车载命令等，可广泛用于语料库语言学、会话分析、语音识别、说话人识别等研究。开展一系列基准语音识别实验，实验结果表明：相较于同规模中文语音数据集aishell2，基于此数据集训练的语音识别模型效果更好。

关键词: 中文普通话语音库, 开源数据, 语音识别, 深度学习, 音素平衡

WANG Dong, WANG Liyuan, WANG Daliang, QI Hongwei. DTZH1505：Large Scale Open Source Mandarin Speech Corpus[J]. Computer Engineering and Applications, 2022, 58(11): 295-301.

王东, 王丽媛, 王大亮, 齐红威. DTZH1505：大规模开源中文普通话语音库[J]. 计算机工程与应用, 2022, 58(11): 295-301.

References

[1] LI B，ZHANG Y，SAINATH T，et al.Bytes are all you need：end-to-end multilingual speech recognition and synthesis with bytes[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Brighton，United Kingdom，May 12-17，2019：5621-5625.
[2] GUO J X，SAINATH T N，WEISS R J，et al.A spelling correction model for end-to-end speech recognition[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Brighton，United Kingdom，May 12-17，2019：5651-5655.
[3] DONG L H，WANG F，XU B.Self-attention aligner：a latency-control end-to-end model for ASR using self-attention network and chunk-hopping[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Brighton，United Kingdom，May 12-17，2019：5656-5660.
[4] WANG D，ZHANGX W.Thchs-30：a free Chinese speech corpus[J].arXiv：1512.01882，2015.
[5] WANG D，WU D L，ZHU X Y.TCMSD：a new Chinese continuous speech database[C]//International Conference on Chinese Computing（ICCC），2001.
[6] BU H，DU J Y，NA X Y，et al.AISHELL-1：an open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment（O-COCOSDA），2017：1-5.
[7] DU J Y，NA X Y，LIU X C，et al.AISHELL-2：transforming mandarin ASR research into industrial scale[J].arXiv：1808.10583，2018.
[8] PANAYOTOV V，CHEN G G，POVEY D，et al.Librispeech：an ASR corpus based on public domain audio books[C]//2015 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2015：5206-5210.
[9] ROUSSEAU A，DEL P.TED-LIUM：an automatic speech recognition dedicated corpus[C]//Proceedings of the Eight International Conference on Language Resources and Evaluation（LREC’12），2012.
[10] ROUSSEAU A，DEL P.Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks[C]//LREC，2014.
[11] HERNANDEZ F，NGUYEN V.TED-LIUM 3：twice as much data and corpus repartition for experiments on speaker adaptation[J].arXiv：1805.04699，2018.
[12] CHIBELUSHI C C，DERAVI F，MASON J S D.A review of speech-based bimodal recognition[J].IEEE Transactions on Multimedia，2002，4（1）：23-37.
[13] SUN J，WANG Z，WANG X，et al.Construction of the lexicons for continuous acoustic model training[C]//Proceedings of the Improvement of Intelligence Computer Interface and Application，1995：116-121.
[14] 祖漪清.汉语连续语音数据库的语料设计[J].声学学报，1999（3）：236-247.
ZU Y Q.The text design for continuous speech database of standard Chinese[J].Acta Acustica，1999（3）：236-247.
[15] 权立宏.小型汉语口语语料库建设探讨[J].广东外语外贸大学学报，2017，28（4）：69-74.
QUAN L H.A study of construction of small-sized Chinese spoken corpora[J].Journal of Guangdong University of Foreign Studies，2017，28（4）：69-74.
[16] DANIEL P，GHOSHAL A K，BOULIANNE G，et al.The kaldi speech recognition toolkit[C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding，2011.
[17] DEHAK N，KENNY P.Front?end factor analysis for speaker verification[J].IEEE Transactions on Audio，Speech，and Language Processing，2011，19（4）：788-798.
[18] PEDDINTI V，POVEY D，KHUDANPUR S.A time delay neural network architecture for efficient modeling of long temporal contexts[C]//INTERSPEECH，2015.