Computer Engineering and Applications ›› 2015, Vol. 51 ›› Issue (5): 116-120.

Previous Articles     Next Articles

Research on joint Chinese-Japanese word segmentation for phrase-based statistical machine translation

WU Peihao, XU Jin’an, ZHANG Yujie   

  1. Beijing Jiaotong University, Beijing 100044, China
  • Online:2015-03-01 Published:2015-04-08

面向短语统计机器翻译的汉日联合分词研究

吴培昊,徐金安,张玉洁   

  1. 北京交通大学,北京 100044

Abstract: Unknown words and word segmentation granularity are two main problems for Chinese-Japanese machine translation. Word segmentation is the first important step for Chinese and Japanese natural language processing. As Chinese and Japanese word segmentation is processed with different tagging system and semantic performance, the granularity of word segmentation results should be readjusted to improve the performance of Statistical Machine Translation (SMT). This paper proposes an approach to adjust the word segmentation granularity for improving the performance of SMT, which combines Hanzi-Kanji comparison table and Japanese-Chinese dictionary. Experimental results express that the proposed method could adjust the granularity between Chinese and Japanese effectively and improve the performance of SMT. This paper analyses the experimental results and discusses the effect of joint Chinese-Japanese word segmentation granularity for phrase-based SMT.

Key words: segmentation granularity, Kanji-Hanzi comparison table, Chinese-Japanese Machine Translation(MT)

摘要: 未登录词与分词粒度是汉日日汉机器翻译研究的两个主要问题。与英语等西方语言不同,汉语与日语词语间不存在空格,分词为汉日双语处理的重要工作。由于词性标注体系、文法及语义表现上的差异,分词结果的粒度需要进一步调整,以改善统计机器翻译系统的性能。提出了面向统计机器翻译的基于汉日汉字对照表及日汉词典信息的汉语与日语的分词粒度调整方法。实验结果表明,该方法能有效地调节源语言和目标语言端的分词粒度,提高统计机器翻译系统的性能。通过对比实验结果,分析探讨分词粒度对汉日双语统计系统性能的影响。

关键词: 分词粒度, 汉字对照表, 汉日机器翻译