计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (17): 30-34.DOI: 10.3778/j.issn.1002-8331.1806-0117

• 热点与综述 • 上一篇    下一篇

面向专业领域的中文分词方法

成于思1,施云涛2   

  1. 1.东南大学 土木工程学院,南京 210096
    2.中国移动通信集团 南京分公司网络部,南京 210019
  • 出版日期:2018-09-01 发布日期:2018-08-30

Domain specific Chinese word segmentation

CHENG Yusi1, SHI Yuntao2   

  1. 1.School of Civil Engineering, Southeast University, Nanjing 210096, China
    2.Nanjing Branch Network Department, China Mobile Communications Group, Nanjing 210019, China
  • Online:2018-09-01 Published:2018-08-30

摘要: 在专业领域分词任务中,基于统计的分词方法的性能受限于缺少专业领域的标注语料,而基于词典的分词方法在处理新词和歧义词方面还有待提高。针对专业领域分词的特殊性,提出统计与词典相结合的分词方法,完善领域词典构建流程,设计基于规则和字表的二次分词歧义消解方法。在工程法领域语料上进行分词实验。实验结果表明,在工程法领域的分词结果准确率为92.08%,召回率为94.26%,F值为93.16%。该方法还可与新词发现等方法结合,改善未登录词的处理效果。

关键词: 中文分词, 专业领域, 歧义消解, 领域词典, 工程法

Abstract: The performance of statistical methods for Chinese word segmentation is limited owing to lack of the specific training corpus, and the dictionary-based methods are affected by unknown words and segmentation ambiguities. To realize domain adaptation, an approach combined statistical methods and a domain dictionary is developed. The approach firstly builds a high quality domain dictionary, and uses a statistical method to obtain preliminary results. Then, an algorithm for eliminating ambiguity is designed based on rules and Chinese character subsets with defined properties. Experimental results on a construction law domain corpus show that the precision, the recall and F-measure achieve 92.08%, 94.26% and 93.16%. The approach combined with new word detection can improve the performance of unknown words processing.

Key words: Chinese word segmentation, domain specific, ambiguity resolution, domain dictionary, construction law