Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (9): 195-200.DOI: 10.3778/j.issn.1002-8331.2102-0274

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Research on Sentence Length Sensitivity in Neural Network Machine Translation

Alim Samat, Sirajahmat Ruzmamat, Maihefureti, Aishan Wumaier, Wushuer Silamu, Turgun Ebrayim   

  1. Laboratory of Multi-Language Information Technology, College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Online:2022-05-01 Published:2022-05-01

神经机器翻译面对句长敏感问题的研究

阿里木·赛买提,斯拉吉艾合麦提·如则麦麦提,麦合甫热提,艾山·吾买尔,吾守尔·斯拉木,吐尔根·依不拉音   

  1. 新疆大学 信息科学与工程学院 多语种信息技术实验中心,乌鲁木齐 830046

Abstract: With the development of deep learning, neural network machine translation has made considerable progress.It is well known that neuro-machine translation is sensitive to sentence length. In order to make full use of the large number of parallel corpus, this paper divides the original parallel corpus into several modules, trains a sub-model for each module, and proposes a neuro-machine translation method based on sentence length fusion strategy. At the end of the training, the translations are obtained by model fusion and three-feature(confusion, sentence length ratio and classifier) fusion sorting methods after the division of sentence length boundaries. The experimental results show that the BLEU points are increased by about 1.2 in English and Chinese tasks on three different test sets and 0.4 to 0.6 in Uyghur tasks. This method has some reference value.

Key words: machine translation, extreme sentence length data, perplexity(PPL), ensemble, deep learning

摘要: 随着深度学习的发展神经网络机器翻译有了长足的进步。众所周知,神经机器翻译方法对句子长度比较敏感。为了充分利用海量平行语料,考虑平行语料句子长度信息,把原平行语料划分若干个模块,为每一个模块训练一个子模型,提出一种按句子长度融合策略的神经机器翻译方法。当训练结束时,通过句长边界划分后的模型融合与三特征(困惑度、句长比与分类器)融合排序方法得到译文。实验结果表明,提出的方法在三个不同测试集上英中任务中平均提高了1.2左右的BLEU点,维汉任务中提升了0.4至0.6的BLEU点。说明该方法具有一定的参考意义。

关键词: 机器翻译, 极端句长数据, 困惑度, 融合, 深度学习