一种基于大规模语料的新词识别方法

计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (21): 157-159.

一种基于大规模语料的新词识别方法

贺敏^1,2，龚才春^1,2，张华平¹，程学旗¹

1.中国科学院计算技术研究所，北京 100080
2.中国科学院研究生院，北京 100080

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-07-21 发布日期:2007-07-21
通讯作者: 贺敏

Method of new word identification based on lager-scale corpus

HE Min^1,2，GONG Cai-chun^1,2，ZHANG Hua-ping¹，CHENG Xue-qi¹

1.Institute of Computing Technology，Chinese Academy of Sciences，Beijing 100080，China
2.Graduate University of Chinese Academy of Sciences，Beijing 100080，China

Received:1900-01-01 Revised:1900-01-01 Online:2007-07-21 Published:2007-07-21
Contact: HE Min

摘要/Abstract

摘要： 提出了一种基于大规模语料的新词识别方法，在重复串统计的基础上，结合分析不同串的外部环境和内部构成，依次判断上下文邻接种类，首尾单字位置成词概率以及双字耦合度等语言特征，分别过滤得到新词。通过在不同规模的语料上实验发现，此方法可行有效，能够应用到词典编撰，术语提取等领域。

关键词: 新词, 邻接类别, 单字成词概率, 双字耦合度

Abstract: The paper proposes a method for new word identification based on large scale corpus，which analyzes the outer lingual environment and inner structure of a string simultaneously.At first，find all the repetitive strings in the text collection，then decide whether a string should be filtrated or not，according to the context varieties，inside word probabilities and double character couplings.At last the remnant words are considered as new words.The experiments have done on corpus with different scale，and the results show that this method is practicable.

Key words: new words, context variety, inside word probability, double character coupling

贺敏^1,2，龚才春^1,2，张华平¹，程学旗¹. 一种基于大规模语料的新词识别方法[J]. 计算机工程与应用, 2007, 43(21): 157-159.

HE Min^1,2，GONG Cai-chun^1,2，ZHANG Hua-ping¹，CHENG Xue-qi¹. Method of new word identification based on lager-scale corpus[J]. Computer Engineering and Applications, 2007, 43(21): 157-159.

[1]	刘永芳，郝晓燕，刘荣. 中国英语新词语料库构建技术研究[J]. 计算机工程与应用, 2020, 56(16): 165-168.
[2]	叶雪梅1，2，毛雪岷1，2，夏锦春1，2，王波1，2. 文本分类TF-IDF算法的改进研究[J]. 计算机工程与应用, 2019, 55(2): 104-109.
[3]	王志涛，於志文，郭斌，路新江. 基于词典和规则集的中文微博情感分析[J]. 计算机工程与应用, 2015, 51(8): 218-225.
[4]	张海军1，2，李勇2，闫琪琪2. 一种基于海量语料的网络热点新词识别方法[J]. 计算机工程与应用, 2015, 51(5): 208-213.
[5]	张海军1，2，彭成1，栾静1. 基于外部排序的字串左右熵快速计算方法[J]. 计算机工程与应用, 2011, 47(19): 18-20.