Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (10): 106-114.DOI: 10.3778/j.issn.1002-8331.1905-0122

Previous Articles     Next Articles

Domain Information Sharing Method in Mongolian-Chinese Machine Translation Application

ZHANG Zhen, SU Yila, NIU Xianghua, GAO Fen, ZHAO Yaping, Ren Qing Daoer Ji   

  1. School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010000, China
  • Online:2020-05-15 Published:2020-05-13

域信息共享的方法在蒙汉机器翻译中的应用

张振,苏依拉,牛向华,高芬,赵亚平,仁庆道尔吉   

  1. 内蒙古工业大学 信息工程学院,呼和浩特 010000

Abstract:

Mongolian-Chinese translation is a translation of low-resource language, facing the difficulty of the scarcity of parallel corpus resources. In order to alleviate the problem of low translation accuracy caused by the scarcity of parallel corpus data and vocabulary limitation, this paper uses dynamic data pre-training method ELMo(Embeddings from Language Models), and combines the Transformer translation architecture for multi-tasking domain information sharing in the Mongolian-Chinese translation. Firstly, ELMo(deep contextualized word representation) is used for the pre-training of the Monolingual corpus. Secondly, this paper uses the FastText word embedding algorithm to pre-train the context-related large-scale text in the Mongolian-Chinese parallel corpus. Then, according to the principle of multi-task sharing parameters to realize domain information sharing, a one-to-many encoder-decoder model is constructed for Mongolian-Chinese neural machine translation. The experimental results show that the translation method can effectively improve the translation quality in the long sentence input sequence than the Transformer baseline translation method.

Key words: Mongolian-Chinese translation, multi-task learning, Transformer, ELMo, FastText

摘要:

蒙汉翻译属于低资源语言的翻译,面临着平行语料资源稀缺的困难,为了缓解平行语料数据稀缺和词汇表受限引发的翻译正确率低的问题,利用动态的数据预训练方法ELMo(Embeddings from Language Models),并结合多任务域信息共享的Transformer翻译架构进行蒙汉翻译。利用ELMo(深层语境化词表示)进行单语语料的预训练。利用FastText词嵌入算法把蒙汉平行语料库中的上下文语境相关的大规模文本进行预训练。根据多任务共享参数以实现域信息共享的原理,构建了一对多的编码器-解码器模型进行蒙汉神经机器翻译。实验结果表明,该翻译方法比Transformer基线翻译方法在长句子输入序列中可以有效提高翻译质量。

关键词: 蒙汉翻译, 多任务学习, Transformer, ELMo, FastText