Domain specific Chinese word segmentation

doi:10.3778/j.issn.1002-8331.1806-0117

Abstract

Abstract: The performance of statistical methods for Chinese word segmentation is limited owing to lack of the specific training corpus, and the dictionary-based methods are affected by unknown words and segmentation ambiguities. To realize domain adaptation, an approach combined statistical methods and a domain dictionary is developed. The approach firstly builds a high quality domain dictionary, and uses a statistical method to obtain preliminary results. Then, an algorithm for eliminating ambiguity is designed based on rules and Chinese character subsets with defined properties. Experimental results on a construction law domain corpus show that the precision, the recall and F-measure achieve 92.08%, 94.26% and 93.16%. The approach combined with new word detection can improve the performance of unknown words processing.

Key words: Chinese word segmentation, domain specific, ambiguity resolution, domain dictionary, construction law

摘要： 在专业领域分词任务中，基于统计的分词方法的性能受限于缺少专业领域的标注语料，而基于词典的分词方法在处理新词和歧义词方面还有待提高。针对专业领域分词的特殊性，提出统计与词典相结合的分词方法，完善领域词典构建流程，设计基于规则和字表的二次分词歧义消解方法。在工程法领域语料上进行分词实验。实验结果表明，在工程法领域的分词结果准确率为92.08%，召回率为94.26%，F值为93.16%。该方法还可与新词发现等方法结合，改善未登录词的处理效果。

关键词: 中文分词, 专业领域, 歧义消解, 领域词典, 工程法

CHENG Yusi1, SHI Yuntao2. Domain specific Chinese word segmentation[J]. Computer Engineering and Applications, 2018, 54(17): 30-34.

成于思1，施云涛2. 面向专业领域的中文分词方法[J]. 计算机工程与应用, 2018, 54(17): 30-34.

[1]	TU Wenbo, YUAN Zhenming, YU Kai. Convolutional Neural Networks Without Pooling Layer for Chinese Word Segmentation [J]. Computer Engineering and Applications, 2020, 56(2): 120-126.
[2]	SUN Baoshan, LI Wei. Recurrent Neural Network for Chinese Word Segmentation with Peephole-Connections [J]. Computer Engineering and Applications, 2019, 55(19): 160-165.
[3]	ZHAO Weifeng1，2, ZHANG Qin1. Automatic identification of address description in unstructured Chinese natural language [J]. Computer Engineering and Applications, 2016, 52(23): 19-24.
[4]	ZHU Yanhui, LIU Jing, XU Yeqiang, TIAN Hailong, MA Jin. Chinese word segmentation research based on Conditional Random Field [J]. Computer Engineering and Applications, 2016, 52(15): 97-100.
[5]	ZHANG Sifa, MA Yongge. Design and implementation of vertical search engine for field of geosciences [J]. Computer Engineering and Applications, 2012, 48(33): 85-88.
[6]	YE Jiping, ZHANG Guizhu. Research and improvement of Chinese word segmentation dictionary [J]. Computer Engineering and Applications, 2012, 48(23): 139-142.
[7]	ZHAO Youqiao1, ZHANG Shanshan1, LU Songfeng1, WU Zhijie2. COX：Chinese-oriented XML compressor with high compression ratio [J]. Computer Engineering and Applications, 2012, 48(17): 143-147.
[8]	LI Guohe1，2，3, LIU Guangsheng1，2，3, QIN Bobo1，2，3, WU Weijiang1，2，3, LI Hongqi1，2，3. Method of Chinese word rough segmentation by maximum match and ambiguity detection algorithms [J]. Computer Engineering and Applications, 2012, 48(14): 139-142.
[9]	YU Jiangde¹，WANG Xijie¹，FAN Xiaozhong². Comparing of importance of above-context versus below-context for Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 117-120.
[10]	LIU Ronghui^1，2，ZHENG Jianguo¹. Clustering algorithm in Deep Web based on Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 138-140.
[11]	YOU Huili，YAN Li，YANG Xiaodong. Research on combinational ambiguity strings in Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(31): 125-127.
[12]	ZHANG Jin-song，YUAN Jian. Backtracking matching Chinese segmentation method [J]. Computer Engineering and Applications, 2009, 45(22): 132-134.
[13]	ZHANG Qing-yang,CHAI Sheng. Chinese word segmentation dictionary using two-level index [J]. Computer Engineering and Applications, 2009, 45(19): 139-141.
[14]	LIU Dan,FANG Wei-guo,ZHOU Hong. Performance of smoothing algorithm in Chinese word segmentation by bigram [J]. Computer Engineering and Applications, 2009, 45(17): 33-36.
[15]	ZHOU Sheng^1,2,HU Xiao-feng¹,LUO Pi¹. Research on automatization of virtual news system [J]. Computer Engineering and Applications, 2008, 44(36): 20-23.

Domain specific Chinese word segmentation

面向专业领域的中文分词方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics