Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus

doi:10.3778/j.issn.1002-8331.2010.31.037

Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (31): 130-134.DOI: 10.3778/j.issn.1002-8331.2010.31.037

• 数据库、信号与信息处理 • Previous Articles Next Articles

Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus

XIAO Jian，XU Jian，XU Xiao-lan，YUAN Qi

China Center for Information Industry Development，Beijing 100044，China

Received:2009-12-30 Revised:2010-03-17 Online:2010-11-01 Published:2010-11-01
Contact: XIAO Jian

英中可比语料库中多词表达自动提取与对齐

肖健，徐建，徐晓兰，袁琦

中国电子信息产业发展研究院，北京 100044

通讯作者: 肖健

Abstract

Abstract: Multiword Expressions（MWE） are important for practical applications，such as machine translation（henceforth，MT），multilingual information retrieval，data mining and other natural language processing.A method of combining semantic template and statistical tool is proposed for automatically extracting native English MWE from three-tuple comparable corpus.Thesaurus-based and distributional methods are harnessed to calculate the semantic relations between words for improving MWE coverage.GIZA++ is executed to align words at sentence level，aiming at obtaining Chinese MWE candidates.For each native English MWE，all of the Chinese MWE candidates are collected and sorted according to their co-occurrence affinity.Only the top one is accepted as true Chinese translation of the given English MWE.Experimental results show the proposed technique improves MWE extraction and alignment efficiently.

Key words: three-tuple comparable corpus, multiword expressions（MWE）, semantic template

摘要： 多词表达（MWE）不仅用来提高当前机器翻译系统质量，而且也用于跨语言检索和数据挖掘等其他自然语言处理领域。为此，提出了基于语义模板与基于统计工具相结合的方法从三元组可比语料库中自动提取本族英语MWE。采用基于词表和分布方法计算词语间的相似度，扩大MWE覆盖范围。利用GIZA++对齐算法提取对译的中文MWE，依据统计方法计算互译概率信息，根据概率大小，选择最佳英汉MWE互译对。实验结果表明上述方法可以有效提高MWE提取和对齐的准确率。

关键词: 三元组可比语料库, 多词表达（MWE）, 语义模板

CLC Number:

TP391

XIAO Jian，XU Jian，XU Xiao-lan，YUAN Qi. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering and Applications, 2010, 46(31): 130-134.

肖健，徐建，徐晓兰，袁琦. 英中可比语料库中多词表达自动提取与对齐[J]. 计算机工程与应用, 2010, 46(31): 130-134.

[1]	CHEN Wang¹，LI Bo1，SHI Yanjun²，TENG Hongfei². Differential evolution algorithm with estimation of distribution for solving RCPSP problem [J]. Computer Engineering and Applications, 2011, 47(4): 1-4.
[2]	SHA Quanyou¹，SHI Jinfa¹，QIN Xiansheng². Research on dynamical decomposition and optimization configuration in aeronautic manufacturing field [J]. Computer Engineering and Applications, 2011, 47(4): 9-12.
[3]	DAI Qin，LIU Jianbo，LIU Shibin. Analysis of remote sensing information extraction using swarm intelligence method [J]. Computer Engineering and Applications, 2011, 47(4): 13-16.
[4]	LIU Guangshuai，LI Bailin，HE Chaoming. Patch-graph sparse optimization methods based on piecewise smooth surfaces reconstruction [J]. Computer Engineering and Applications, 2011, 47(4): 22-25.
[5]	LONG Yinfang，SHANG Junna. Frequency offset estimation for MC-CDMA systems [J]. Computer Engineering and Applications, 2011, 47(4): 102-104.
[6]	YU Jiangde¹，WANG Xijie¹，FAN Xiaozhong². Comparing of importance of above-context versus below-context for Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 117-120.
[7]	PEI Yingbo¹，LIU Xiaoxia². Study on improved CHI for feature selection in Chinese text categorization [J]. Computer Engineering and Applications, 2011, 47(4): 128-130.
[8]	ZHANG Yu，LUO Ke. OC-SVM-based classification for large-scale data sets [J]. Computer Engineering and Applications, 2011, 47(4): 131-133.
[9]	LIU Ronghui^1，2，ZHENG Jianguo¹. Clustering algorithm in Deep Web based on Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 138-140.
[10]	CAI Rangjia. Tibetan studies of corpus description method [J]. Computer Engineering and Applications, 2011, 47(4): 146-148.
[11]	LIU Xiuling，LIU Jing，WANG Hongrui，GUO Lei. Fast collision detection based on improved honeycomb-shape spatial decomposition [J]. Computer Engineering and Applications, 2011, 47(4): 149-153.
[12]	ZHANG Cong，GUI Zhiguo. Non-linear image sharpening approach based on noise estimation [J]. Computer Engineering and Applications, 2011, 47(4): 154-156.
[13]	FU Xiaojun¹，GUO Pengjiang¹，GUO Jing²，FENG Jun². 3D model classification based on statistical features and Markov models [J]. Computer Engineering and Applications, 2011, 47(4): 157-159.
[14]	CHEN Huijie，LAI Huicheng，JIA Zhiqiang. Double color image information hiding based on image mix and wavelet transform [J]. Computer Engineering and Applications, 2011, 47(4): 171-173.
[15]	YANG Xiaoqin，JI Xiaoyong. Fast motion estimation algorithm based on H.264 [J]. Computer Engineering and Applications, 2011, 47(4): 174-175.

Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus

英中可比语料库中多词表达自动提取与对齐

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics