计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (13): 153-157.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

构建和剖析中英三元组可比语料库

胡小鹏,袁  琦,耿鑫辉,朱  姝   

  1. 中国电子信息产业发展研究院,北京 100044
  • 出版日期:2014-07-01 发布日期:2015-05-12

Building and profiling Chinese-English 3-tuple comparable corpora

HU Xiaopeng, YUAN Qi, GENG Xinhui, ZHU Shu   

  1. China Center for Information Industry Development(CCID), Beijing 100044, China
  • Online:2014-07-01 Published:2015-05-12

摘要: 由于受到翻译腔的影响,中英平行语料库存在固有的扭斜的语言模型。显然,用这样的语料库训练的机器翻译、跨语言检索等自然语言处理系统也承袭了扭斜的语言模型,严重影响到应用系统的性能。为了克服平行语料库固有的缺陷,提出构建和剖析中英三元组可比语料库的技术研究。这项研究采用可比语料库和语言自动剖析技术,使用统计和规则相结合的方法,对由本族英语、中式英语和标准中文三元素所组成的三元组可比语料库中的本族英语和中式英语进行统计分析。在此基础上,利用n-元词串、关键词簇等自动抽取技术挖掘基于本族语言模型的双语资源,实现改进和发展机器翻译等自然语言的处理应用。

关键词: 三元组可比语料库, 语言迁移, 自动语言剖析, n-元词串

Abstract: There exists inherent skewed language model in Chinese-English parallel corpus due to the influence of translationese. Obviously, natural language processing systems trained with these corpora, including machine translation and cross-language information retrieval, will inherit the skewed language model, thus seriously degrading the performance of applications. To fix the inherent defaults in parallel corpus, this paper proposes a technical research on building and profiling Chinese-English 3-tuple comparable corpora. The study adopts comparable corpora and automatic language profiling technologies and applies a combined method of statistics and rules for statistical analysis on native English and Chinglish in 3-tuple comparable corpora that consists of native English, Chinglish and standard Chinese. Based on this, automatic extraction technologies, such as n-grams and key clusters, are used in the mining of native-language-based bilingual resources to improve and develop natural language processing applications such as machine translation.

Key words: 3-tuple comparable corpora, language transfer, automatic language profiling, n-grams