计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (14): 139-142.

• 数据库、信号与信息处理 • 上一篇    下一篇

综合最大匹配和歧义检测的中文分词粗分方法

李国和1,2,3,刘光胜1,2,3,秦波波1,2,3,吴卫江1,2,3,李洪奇1,2,3   

  1. 1.中国石油大学 地球物理与信息工程学院,北京 102249
    2.中国石油大学(北京) 油气资源与探测国家重点实验室,北京 102249
    3.石大兆信数字身份管理与物联网技术研究院,北京 100029
  • 出版日期:2012-05-11 发布日期:2012-05-14

Method of Chinese word rough segmentation by maximum match and ambiguity detection algorithms

LI Guohe1,2,3, LIU Guangsheng1,2,3, QIN Bobo1,2,3, WU Weijiang1,2,3, LI Hongqi1,2,3   

  1. 1.College of Geophysics and Information Engineering, China University of Petroleum, Beijing 102249, China
    2.The State Key Lab of Petroleum Resource and Prospecting, China University of Petroleum, Beijing 102249, China
    3.PanPass Institute of Digital Identification Management and Internet of Things, Beijing 100029, China
  • Online:2012-05-11 Published:2012-05-14

摘要: 中文分词是中文文本信息处理的重要预处理。针对目前中文分词中存在的准确率低和粗分结果集大的问题,在最大匹配算法基础上,采用文本切分时的组合歧义检测和交叉歧义检测以及全切分算法,提高了文本粗分的准确率,并减小了粗分结果集的规模,为进一步正确分词奠定基础。通过公共语料库数据集的实验对比,取得很好的效果。

关键词: 中文分词, 粗分, 最大匹配算法, 全切分算法, 歧义检测

Abstract: Segmentation of words in Chinese text is very important preprocessing in Chinese information processing. In present, for some demerits such as low accuracy of Chinese word segmentation and big set of Chinese word rough segmentation, a method, CWRS, based on maximal match algorithm is proposed along with omni-segmentation algorithm. It greatly improves the accuracy and reduces the set of rough segmentation according to combination of ambiguity detection and cross ambiguity detection, which lays the foundation for precise segmentation of words in Chinese text. All the experiments are good effects by comparison of CWRS with other algorithms on the same data set of common Chinese texts.

Key words: Chinese word segmentation, rough segmentation, maximum match algorithm, omni-segmentation algorithm, ambiguity detection