基于改进最大匹配算法的中文分词粗分方法

计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (2): 124-128.

• 数据库、数据挖掘、机器学习 • 上一篇下一篇

基于改进最大匹配算法的中文分词粗分方法

周俊1，3，郑中华2，张炜3

1.华中科技大学模具技术国家重点实验室，武汉 430074
2.中国人民大学教育学院，北京 100872
3.安徽博约信息科技有限责任公司，合肥 230088

出版日期:2014-01-15 发布日期:2014-01-26

Method of Chinese words rough segmentation based on improving maximum match algorithm

ZHOU Jun1，3, ZHENG Zhonghua2, ZHANG Wei3

1.State Key Lab of Mold Technology, Huazhong University of Science and Technology, Wuhan 430074, China
2.School of Education, Renmin University of China, Beijing 100872, China
3.Anhui Boryou Information Technology Co.Ltd, Hefei 230088, China

Online:2014-01-15 Published:2014-01-26

摘要/Abstract

摘要： 中文粗分和歧义消解是中文分词的两大基本过程。通过引入广义词条和诱导词集，在最大匹配算法基础上提出一种中文分词的粗分方法，以最长广义词匹配为原则进行中文分词，利用诱导词集实现交叉型歧义识别。在保证快速准确切分无歧义汉语语句的同时，100%检测并标记有歧义汉语语句中的交叉型歧义，最大程度上简化后续歧义消解过程。通过对含有160万汉字1998年1月人民日报语料测试的结果证明了算法速度、歧义词准确率以及粗分召回率的有效性。

关键词: 中文分词, 最大匹配, 广义词, 诱导词集

Abstract: Chinese words rough segmentation and ambiguity resolution are two fundamental processes of Chinese word segmentation. Under the introduction of generalized term and induced word set, a method used for Chinese words rough segmentation is proposed based on maximum matching method. It executes Chinese word segmentation under the principle of the longest generalized term matching, and recognizes the overlapping ambiguities by utilizing induced word set. It segments Chinese sentences without any ambiguity rapidly and accurately, detects and marks ambiguities by 100 percent in those sentences which have ambiguities, which will simplify the process of ambiguity resolution to the maximum extent. The result of the experiment on People’s Daily corpus in January 1998 which contains 1.6 million Chinese characters shows the method is effective both in speed and accuracy.

Key words: Chinese words segmentation, maximum match, generalized term, induced word set

周俊1，3，郑中华2，张炜3. 基于改进最大匹配算法的中文分词粗分方法[J]. 计算机工程与应用, 2014, 50(2): 124-128.

ZHOU Jun1，3, ZHENG Zhonghua2, ZHANG Wei3. Method of Chinese words rough segmentation based on improving maximum match algorithm[J]. Computer Engineering and Applications, 2014, 50(2): 124-128.

[1]	涂文博，袁贞明，俞凯. 无池化层卷积神经网络的中文分词方法[J]. 计算机工程与应用, 2020, 56(2): 120-126.
[2]	孙宝山，李玮. 窥视孔连接的循环网络在中文分词上的研究[J]. 计算机工程与应用, 2019, 55(19): 160-165.
[3]	成于思1，施云涛2. 面向专业领域的中文分词方法[J]. 计算机工程与应用, 2018, 54(17): 30-34.
[4]	张绍阳，曹家波，王子凡，曲卫东. 基于加权二部图匹配的中文段落相似度计算[J]. 计算机工程与应用, 2017, 53(18): 95-101.
[5]	赵卫锋1，2，张勤1. 非结构化中文自然语言地址描述的自动识别[J]. 计算机工程与应用, 2016, 52(23): 19-24.
[6]	朱艳辉，刘璟，徐叶强，田海龙，马进. 基于条件随机场的中文领域分词研究[J]. 计算机工程与应用, 2016, 52(15): 97-100.
[7]	唐敏1，关健1，邓国强1，靳强2. 二部图最大匹配问题的分层网络优化模型[J]. 计算机工程与应用, 2012, 48(36): 90-94.
[8]	张思发，马永格. 面向地学信息领域垂直搜索引擎设计与实现[J]. 计算机工程与应用, 2012, 48(33): 85-88.
[9]	叶继平，张桂珠. 中文分词词典结构的研究与改进[J]. 计算机工程与应用, 2012, 48(23): 139-142.
[10]	赵友桥1，张山山1，路松峰1，吴志杰2. COX：高压缩率的中文XML文档压缩技术[J]. 计算机工程与应用, 2012, 48(17): 143-147.
[11]	李国和1，2，3，刘光胜1，2，3，秦波波1，2，3，吴卫江1，2，3，李洪奇1，2，3. 综合最大匹配和歧义检测的中文分词粗分方法[J]. 计算机工程与应用, 2012, 48(14): 139-142.
[12]	刘荣辉^1，2，郑建国¹. Deep Web下基于中文分词的聚类算法[J]. 计算机工程与应用, 2011, 47(4): 138-140.
[13]	程传鹏. 网络评价倾向性研究[J]. 计算机工程与应用, 2011, 47(25): 156-159.
[14]	张培颖. 运用有向图进行中文分词研究[J]. 计算机工程与应用, 2009, 45(22): 123-125.
[15]	张劲松，袁健. 回溯正向匹配中文分词算法[J]. 计算机工程与应用, 2009, 45(22): 132-134.

基于改进最大匹配算法的中文分词粗分方法

Method of Chinese words rough segmentation based on improving maximum match algorithm

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics