Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (23): 147-150.DOI: 10.3778/j.issn.1002-8331.2008.23.045

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Technical term automatic extraction research based on statistics and rules

LIU Bao,ZHANG Gui-ping,CAI Dong-feng   

  1. Knowledge Engineering Center,Shenyang Institute of Aeronautical Engineering,Shenyang 110034,China
  • Received:2007-10-18 Revised:2008-01-18 Online:2008-08-11 Published:2008-08-11
  • Contact: LIU Bao

基于统计和规则相结合的科技术语自动抽取研究

刘 豹,张桂平,蔡东风   

  1. 沈阳航空工业学院 知识工程中心,沈阳 110034
  • 通讯作者: 刘 豹

Abstract: Technical term automatic extraction is one of the important topics in Chinese information processing.It has been widely applied to information retrieval,machine translation,especially in the patent machine translation.In this paper,the research mainly focuses on the recognizing method of the technical term combined the patent machine translation task,proposes a technical term recognition method based on the statistics and rules at the base of the analysis of existed method.It first uses Conditional Random Fields(CRF) model to label and recognize the corpus,then a post-processing step based on rules is used to correct the wrong labeled result.The experiment results show the method is efficient for identifying technical terms,in open test the F-value reaches 84.4%.

Key words: Conditional Random Fields(CRF), technical term extraction, term recognition

摘要: 科技术语自动抽取是中文信息处理领域的一个重要研究课题,在信息检索、机器翻译等领域,特别是在专利翻译中有着广泛应用。结合专利翻译任务,主要研究专利中科技术语的识别方法,在分析目前已有方法的基础之上,提出了一种使用条件随机场模型进行标注识别,并结合规则对错误识别结果进行后处理的科技术语识别方法。实验结果表明,提出的统计和规则相结合的识别方法是有效的,开放测试结果F值达到了84.4%。

关键词: 条件随机场, 科技术语抽取, 术语识别