计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (12): 175-177.

• 数据库、信号与信息处理 • 上一篇    下一篇

一种基于局部歧义词网格的快速分词算法

张国兵1,2,李 淼1   

  1. 1.中国科学院 合肥智能机械研究所,合肥 230031
    2.中国科学技术大学,合肥 230026
  • 收稿日期:2007-08-07 修回日期:2007-11-13 出版日期:2008-04-21 发布日期:2008-04-21
  • 通讯作者: 张国兵

Rapid word segmentation algorithm based on local ambiguity word grid

ZHANG Guo-bing1,2,LI Miao1   

  1. 1.Institute of Intelligent Machines,Chinese Academy of Sciences,Hefei 230031,China
    2.University of Science and Technology of China,Hefei 230026,China
  • Received:2007-08-07 Revised:2007-11-13 Online:2008-04-21 Published:2008-04-21
  • Contact: ZHANG Guo-bing

摘要: 提出了局部歧义词网格的概念,针对汉语分词中的覆盖歧义,提出了一种使用迭代算法训练覆盖歧义词典的算法,得到覆盖歧义候选词条词典。在此基础上提出了一种基于局部歧义词网格的、能够检测汉语分词过程中产生的组合歧义和覆盖歧义的分词算法,该算法仅考虑存在歧义的局部歧义词网格,并将对覆盖歧义的处理简化为查询覆盖歧义候选词典,因此,该算法的时间复杂度大幅下降。实验结果表明,该算法能够实现快速的汉语分词,且其分词正确率能够达到97%以上。

关键词: 汉语分词, 覆盖歧义, 交叉歧义, 局部歧义词网格

Abstract: This paper presents the concept of local ambiguity word grid.Aiming at the overlay ambiguity in Chinese word segmentation,the article puts forward an algorithm that applies iterative algorithm to train overlay ambiguity dictionary and then a backup lexical item dictionary of overlay ambiguity can be obtained.On this basis,the paper brings in a word segmentation algorithm based on local ambiguity grid which is capable of detecting compounding ambiguity and overlay ambiguity emerging from the process of Chinese word segmentation.This algorithm just calculates a local ambiguity grid instead of the entire ambiguity section and simplifies the processing of overlay ambiguity to just inquiring into the backup dictionary of the related overlay ambiguity,the new approach will help reduce the processing time remarkably.The experiment demonstrates that the algorithm can fulfill the rapidness of segmenting Chinese words and the correctness can reach the level of 97%.

Key words: sentence segmentation, overlay ambiguity, overlapping ambiguity, local ambiguity word grid