综合最大匹配和歧义检测的中文分词粗分方法

计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (14): 139-142.

• 数据库、信号与信息处理 • 上一篇下一篇

综合最大匹配和歧义检测的中文分词粗分方法

李国和1，2，3，刘光胜1，2，3，秦波波1，2，3，吴卫江1，2，3，李洪奇1，2，3

1.中国石油大学地球物理与信息工程学院，北京 102249
2.中国石油大学（北京）油气资源与探测国家重点实验室，北京 102249
3.石大兆信数字身份管理与物联网技术研究院，北京 100029

出版日期:2012-05-11 发布日期:2012-05-14

Method of Chinese word rough segmentation by maximum match and ambiguity detection algorithms

LI Guohe1，2，3, LIU Guangsheng1，2，3, QIN Bobo1，2，3, WU Weijiang1，2，3, LI Hongqi1，2，3

1.College of Geophysics and Information Engineering, China University of Petroleum, Beijing 102249, China
2.The State Key Lab of Petroleum Resource and Prospecting, China University of Petroleum, Beijing 102249, China
3.PanPass Institute of Digital Identification Management and Internet of Things, Beijing 100029, China

Online:2012-05-11 Published:2012-05-14

摘要/Abstract

摘要： 中文分词是中文文本信息处理的重要预处理。针对目前中文分词中存在的准确率低和粗分结果集大的问题，在最大匹配算法基础上，采用文本切分时的组合歧义检测和交叉歧义检测以及全切分算法，提高了文本粗分的准确率，并减小了粗分结果集的规模，为进一步正确分词奠定基础。通过公共语料库数据集的实验对比，取得很好的效果。

关键词: 中文分词, 粗分, 最大匹配算法, 全切分算法, 歧义检测

Abstract: Segmentation of words in Chinese text is very important preprocessing in Chinese information processing. In present, for some demerits such as low accuracy of Chinese word segmentation and big set of Chinese word rough segmentation, a method, CWRS, based on maximal match algorithm is proposed along with omni-segmentation algorithm. It greatly improves the accuracy and reduces the set of rough segmentation according to combination of ambiguity detection and cross ambiguity detection, which lays the foundation for precise segmentation of words in Chinese text. All the experiments are good effects by comparison of CWRS with other algorithms on the same data set of common Chinese texts.

Key words: Chinese word segmentation, rough segmentation, maximum match algorithm, omni-segmentation algorithm, ambiguity detection

李国和1，2，3，刘光胜1，2，3，秦波波1，2，3，吴卫江1，2，3，李洪奇1，2，3. 综合最大匹配和歧义检测的中文分词粗分方法[J]. 计算机工程与应用, 2012, 48(14): 139-142.

LI Guohe1，2，3, LIU Guangsheng1，2，3, QIN Bobo1，2，3, WU Weijiang1，2，3, LI Hongqi1，2，3. Method of Chinese word rough segmentation by maximum match and ambiguity detection algorithms[J]. Computer Engineering and Applications, 2012, 48(14): 139-142.

[1]	涂文博，袁贞明，俞凯. 无池化层卷积神经网络的中文分词方法[J]. 计算机工程与应用, 2020, 56(2): 120-126.
[2]	孙宝山，李玮. 窥视孔连接的循环网络在中文分词上的研究[J]. 计算机工程与应用, 2019, 55(19): 160-165.
[3]	成于思1，施云涛2. 面向专业领域的中文分词方法[J]. 计算机工程与应用, 2018, 54(17): 30-34.
[4]	张绍阳，曹家波，王子凡，曲卫东. 基于加权二部图匹配的中文段落相似度计算[J]. 计算机工程与应用, 2017, 53(18): 95-101.
[5]	赵卫锋1，2，张勤1. 非结构化中文自然语言地址描述的自动识别[J]. 计算机工程与应用, 2016, 52(23): 19-24.
[6]	朱艳辉，刘璟，徐叶强，田海龙，马进. 基于条件随机场的中文领域分词研究[J]. 计算机工程与应用, 2016, 52(15): 97-100.
[7]	周俊1，3，郑中华2，张炜3. 基于改进最大匹配算法的中文分词粗分方法[J]. 计算机工程与应用, 2014, 50(2): 124-128.
[8]	张思发，马永格. 面向地学信息领域垂直搜索引擎设计与实现[J]. 计算机工程与应用, 2012, 48(33): 85-88.
[9]	叶继平，张桂珠. 中文分词词典结构的研究与改进[J]. 计算机工程与应用, 2012, 48(23): 139-142.
[10]	赵友桥1，张山山1，路松峰1，吴志杰2. COX：高压缩率的中文XML文档压缩技术[J]. 计算机工程与应用, 2012, 48(17): 143-147.
[11]	刘荣辉^1，2，郑建国¹. Deep Web下基于中文分词的聚类算法[J]. 计算机工程与应用, 2011, 47(4): 138-140.
[12]	杨芳,李红睿,田学东. 基于RBF神经网络的汉字粗分类方法[J]. 计算机工程与应用, 2009, 45(6): 170-172.
[13]	张培颖. 运用有向图进行中文分词研究[J]. 计算机工程与应用, 2009, 45(22): 123-125.
[14]	张劲松，袁健. 回溯正向匹配中文分词算法[J]. 计算机工程与应用, 2009, 45(22): 132-134.
[15]	张庆扬,柴胜. 使用二级索引的中文分词词典[J]. 计算机工程与应用, 2009, 45(19): 139-141.

综合最大匹配和歧义检测的中文分词粗分方法

Method of Chinese word rough segmentation by maximum match and ambiguity detection algorithms

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics