面向专业领域的中文分词方法

doi:10.3778/j.issn.1002-8331.1806-0117

计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (17): 30-34.DOI: 10.3778/j.issn.1002-8331.1806-0117

面向专业领域的中文分词方法

成于思1，施云涛2

1.东南大学土木工程学院，南京 210096
2.中国移动通信集团南京分公司网络部，南京 210019

出版日期:2018-09-01 发布日期:2018-08-30

Domain specific Chinese word segmentation

CHENG Yusi1, SHI Yuntao2

1.School of Civil Engineering, Southeast University, Nanjing 210096, China
2.Nanjing Branch Network Department, China Mobile Communications Group, Nanjing 210019, China

Online:2018-09-01 Published:2018-08-30

摘要/Abstract

摘要： 在专业领域分词任务中，基于统计的分词方法的性能受限于缺少专业领域的标注语料，而基于词典的分词方法在处理新词和歧义词方面还有待提高。针对专业领域分词的特殊性，提出统计与词典相结合的分词方法，完善领域词典构建流程，设计基于规则和字表的二次分词歧义消解方法。在工程法领域语料上进行分词实验。实验结果表明，在工程法领域的分词结果准确率为92.08%，召回率为94.26%，F值为93.16%。该方法还可与新词发现等方法结合，改善未登录词的处理效果。

关键词: 中文分词, 专业领域, 歧义消解, 领域词典, 工程法

Abstract: The performance of statistical methods for Chinese word segmentation is limited owing to lack of the specific training corpus, and the dictionary-based methods are affected by unknown words and segmentation ambiguities. To realize domain adaptation, an approach combined statistical methods and a domain dictionary is developed. The approach firstly builds a high quality domain dictionary, and uses a statistical method to obtain preliminary results. Then, an algorithm for eliminating ambiguity is designed based on rules and Chinese character subsets with defined properties. Experimental results on a construction law domain corpus show that the precision, the recall and F-measure achieve 92.08%, 94.26% and 93.16%. The approach combined with new word detection can improve the performance of unknown words processing.

Key words: Chinese word segmentation, domain specific, ambiguity resolution, domain dictionary, construction law

成于思1，施云涛2. 面向专业领域的中文分词方法[J]. 计算机工程与应用, 2018, 54(17): 30-34.

CHENG Yusi1, SHI Yuntao2. Domain specific Chinese word segmentation[J]. Computer Engineering and Applications, 2018, 54(17): 30-34.

[1]	涂文博，袁贞明，俞凯. 无池化层卷积神经网络的中文分词方法[J]. 计算机工程与应用, 2020, 56(2): 120-126.
[2]	孙宝山，李玮. 窥视孔连接的循环网络在中文分词上的研究[J]. 计算机工程与应用, 2019, 55(19): 160-165.
[3]	张绍阳，曹家波，王子凡，曲卫东. 基于加权二部图匹配的中文段落相似度计算[J]. 计算机工程与应用, 2017, 53(18): 95-101.
[4]	赵卫锋1，2，张勤1. 非结构化中文自然语言地址描述的自动识别[J]. 计算机工程与应用, 2016, 52(23): 19-24.
[5]	朱艳辉，刘璟，徐叶强，田海龙，马进. 基于条件随机场的中文领域分词研究[J]. 计算机工程与应用, 2016, 52(15): 97-100.
[6]	周俊1，3，郑中华2，张炜3. 基于改进最大匹配算法的中文分词粗分方法[J]. 计算机工程与应用, 2014, 50(2): 124-128.
[7]	张思发，马永格. 面向地学信息领域垂直搜索引擎设计与实现[J]. 计算机工程与应用, 2012, 48(33): 85-88.
[8]	叶继平，张桂珠. 中文分词词典结构的研究与改进[J]. 计算机工程与应用, 2012, 48(23): 139-142.
[9]	赵友桥1，张山山1，路松峰1，吴志杰2. COX：高压缩率的中文XML文档压缩技术[J]. 计算机工程与应用, 2012, 48(17): 143-147.
[10]	李国和1，2，3，刘光胜1，2，3，秦波波1，2，3，吴卫江1，2，3，李洪奇1，2，3. 综合最大匹配和歧义检测的中文分词粗分方法[J]. 计算机工程与应用, 2012, 48(14): 139-142.
[11]	刘荣辉^1，2，郑建国¹. Deep Web下基于中文分词的聚类算法[J]. 计算机工程与应用, 2011, 47(4): 138-140.
[12]	张培颖. 运用有向图进行中文分词研究[J]. 计算机工程与应用, 2009, 45(22): 123-125.
[13]	张劲松，袁健. 回溯正向匹配中文分词算法[J]. 计算机工程与应用, 2009, 45(22): 132-134.
[14]	张庆扬,柴胜. 使用二级索引的中文分词词典[J]. 计算机工程与应用, 2009, 45(19): 139-141.
[15]	刘丹,方卫国,周泓. 二元语法中文分词数据平滑算法性能研究[J]. 计算机工程与应用, 2009, 45(17): 33-36.

面向专业领域的中文分词方法

Domain specific Chinese word segmentation

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics