计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (19): 9-11.DOI: 10.3778/j.issn.1002-8331.2009.19.003

• 博士论坛 • 上一篇    下一篇

一种面向术语抽取的短语过滤技术

周 浪1,2,冯 冲2,黄河燕2   

  1. 1.南京理工大学 计算机科学与技术学院,南京 210094
    2.中国科学院 计算机语言与信息工程研究中心,北京 100097
  • 收稿日期:2009-04-02 修回日期:2009-05-07 出版日期:2009-07-01 发布日期:2009-07-01
  • 通讯作者: 周 浪

Phrase filtering technology oriented to term extraction

ZHOU Lang1,2,FENG Chong2,HUANG He-yan2   

  1. 1.College of Computer Science and Technology,Nanjing University of Science and Technology,Nanjing 210094,China
    2.Research Center of Computer & Language Information Engineering,CAS,Beijing 100097,China
  • Received:2009-04-02 Revised:2009-05-07 Online:2009-07-01 Published:2009-07-01
  • Contact: ZHOU Lang

摘要: 在术语抽取工作中,经常会遇到一些包含活跃词汇的短语或短语碎片,这些干扰项一般具有稳定的搭配模式,并且在语料中共现的概率也非常高。常用的短语过滤方法都是侧重于计算短语内部词语之间的黏合度,对这些干扰项的鉴别能力并不强。提出了一种基于左右熵的短语过滤方法,估算出短语或短语碎片中词语的活跃度,并过滤掉活跃度较高的短语或短语碎片。将该方法应用到一个术语抽取系统中,实验证实能够有效去除这些干扰项,提升术语抽取系统的性能。

关键词: 术语抽取, 短语过滤, 左右熵, 活跃因子

Abstract: In the term extraction process,some phrases or phrase fragments containing active lexical represent as the noisy,which usually have the stable collocation pattern and a high co-occurrence probability in the corpus.The traditional phrase filtering methods are inclined to measure the cohesion of the inner words,and own less discriminate ability with these active noisy.This paper proposes a phrase filtering approach based on left/right entropy technology to evaluate the active degree of words in the phrases or phrase fragments and filter the ones having high value.Validated by the tests,this approach can effectively remove the active noisy and improve the performance of the multi-word term extraction system.

Key words: term extraction, phrase filtering, left/right entropy, active factor