计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (21): 157-159.

• 数据库与信息处理 • 上一篇    下一篇

一种基于大规模语料的新词识别方法

贺 敏1,2,龚才春1,2,张华平1,程学旗1   

  1. 1.中国科学院 计算技术研究所,北京 100080
    2.中国科学院 研究生院,北京 100080
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-07-21 发布日期:2007-07-21
  • 通讯作者: 贺 敏

Method of new word identification based on lager-scale corpus

HE Min1,2,GONG Cai-chun1,2,ZHANG Hua-ping1,CHENG Xue-qi1   

  1. 1.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100080,China
    2.Graduate University of Chinese Academy of Sciences,Beijing 100080,China

  • Received:1900-01-01 Revised:1900-01-01 Online:2007-07-21 Published:2007-07-21
  • Contact: HE Min

摘要: 提出了一种基于大规模语料的新词识别方法,在重复串统计的基础上,结合分析不同串的外部环境和内部构成,依次判断上下文邻接种类,首尾单字位置成词概率以及双字耦合度等语言特征,分别过滤得到新词。通过在不同规模的语料上实验发现,此方法可行有效,能够应用到词典编撰,术语提取等领域。

关键词: 新词, 邻接类别, 单字成词概率, 双字耦合度

Abstract: The paper proposes a method for new word identification based on large scale corpus,which analyzes the outer lingual environment and inner structure of a string simultaneously.At first,find all the repetitive strings in the text collection,then decide whether a string should be filtrated or not,according to the context varieties,inside word probabilities and double character couplings.At last the remnant words are considered as new words.The experiments have done on corpus with different scale,and the results show that this method is practicable.

Key words: new words, context variety, inside word probability, double character coupling