Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (17): 30-34.DOI: 10.3778/j.issn.1002-8331.1806-0117

Previous Articles     Next Articles

Domain specific Chinese word segmentation

CHENG Yusi1, SHI Yuntao2   

  1. 1.School of Civil Engineering, Southeast University, Nanjing 210096, China
    2.Nanjing Branch Network Department, China Mobile Communications Group, Nanjing 210019, China
  • Online:2018-09-01 Published:2018-08-30

面向专业领域的中文分词方法

成于思1,施云涛2   

  1. 1.东南大学 土木工程学院,南京 210096
    2.中国移动通信集团 南京分公司网络部,南京 210019

Abstract: The performance of statistical methods for Chinese word segmentation is limited owing to lack of the specific training corpus, and the dictionary-based methods are affected by unknown words and segmentation ambiguities. To realize domain adaptation, an approach combined statistical methods and a domain dictionary is developed. The approach firstly builds a high quality domain dictionary, and uses a statistical method to obtain preliminary results. Then, an algorithm for eliminating ambiguity is designed based on rules and Chinese character subsets with defined properties. Experimental results on a construction law domain corpus show that the precision, the recall and F-measure achieve 92.08%, 94.26% and 93.16%. The approach combined with new word detection can improve the performance of unknown words processing.

Key words: Chinese word segmentation, domain specific, ambiguity resolution, domain dictionary, construction law

摘要: 在专业领域分词任务中,基于统计的分词方法的性能受限于缺少专业领域的标注语料,而基于词典的分词方法在处理新词和歧义词方面还有待提高。针对专业领域分词的特殊性,提出统计与词典相结合的分词方法,完善领域词典构建流程,设计基于规则和字表的二次分词歧义消解方法。在工程法领域语料上进行分词实验。实验结果表明,在工程法领域的分词结果准确率为92.08%,召回率为94.26%,F值为93.16%。该方法还可与新词发现等方法结合,改善未登录词的处理效果。

关键词: 中文分词, 专业领域, 歧义消解, 领域词典, 工程法