计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (4): 161-168.DOI: 10.3778/j.issn.1002-8331.1912-0118

• 模式识别与人工智能 • 上一篇    下一篇

带标记音节的双向维汉神经机器翻译方法

艾山·吾买尔,斯拉吉艾合麦提·如则麦麦提,西热艾力·海热拉,刘文其,吐尔根·依布拉音,汪烈军,瓦依提·阿不力孜   

  1. 1.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2.新疆大学 新疆多语种信息技术实验室,乌鲁木齐 830046
    3.新疆大学 软件学院,乌鲁木齐 830091
  • 出版日期:2021-02-15 发布日期:2021-02-06

Bi-directional Uyghur-Chinese Neural Machine Translation with Marked Syllables

Hasan Wumaier, Sirajahmat Ruzmamat, Xireaili Hairela, LIU Wenqi, Tuergen Yibulayin, WANG Liejun, Wayit Abulizi   

  1. 1.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
    2.Xinjiang Laboratory of Multi-language Information Technology, Xinjiang University, Urumqi 830046, China
    3.School of Software, Xinjiang University, Urumqi 830091, China
  • Online:2021-02-15 Published:2021-02-06

摘要:

近年来,基于神经网络的机器翻译成为机器翻译领域的主流方法,但是在低资源翻译领域中仍存在平行语料不足和数据稀疏的挑战。针对维-汉平行语料不足和维吾尔语形态复杂所导致的数据稀疏问题,从维吾尔语的音节特点出发,将单词切分成音节,同时融入BME(Begin,Middle,End)标记思想,提出一种基于带标记音节的神经网络机器翻译方法。与使用单词粒度和BPE粒度的两类神经网络机器翻译方法对比,该方法在维-汉机器翻译任务中分别提升7.39与3.04个BLEU值,在汉-维机器翻译任务中分别提升5.82与3.09个BLEU值,可见在平行语料不足的条件下,该方法有效地提升了维-汉机器翻译的质量。

关键词: 神经机器翻译, 数据稀疏, 音节粒度, 维汉神经机器翻译

Abstract:

In recent years, neural networks have become the mainstream methods used in machine translation, but in the field of low-resource machine translation, parallel corpus shortage and data sparseness remain great challenges. Aiming at the problem of data sparseness caused by insufficient Uyghur-Chinese parallel corpus and complex Uyghur morphology, this paper proposes a neural network method, which is based on the syllable characteristics of Uyghur language, cutting words into syllables, and incorporating the idea of BME(Begin, Middle, End) markup. Compared to the word level and the BPE level, the proposed method improves 7.39 and 3.04 BLEU values respectively in Uyghur-Chinese machine translation tasks, and 5.82 and 3.09 BLEU values respectively in Chinese-Uyghur machine translation. It indicates that under the condition of insufficient parallel corpus, this method can effectively improve the quality of Uyghur-Chinese machine translation.

Key words: neural machine translation, sparse data, syllable level, Uyghur-Chinese neural machine translation