计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (18): 120-125.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于分词与词性标注的汉语逗号自动分类

谷晶晶,周国栋   

  1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006
  • 出版日期:2015-09-15 发布日期:2015-10-13

Chinese comma classification based on segmentation and part of speech tagging

GU Jingjing, ZHOU Guodong   

  1. School of Computer Science & Technology, Soochow University, Suzhou, Jiangsu 215006, China
  • Online:2015-09-15 Published:2015-10-13

摘要: 近年来,标点符号作为篇章的重要部分逐渐引起研究者的关注。然而,针对汉语逗号的研究才刚刚展开,采用的方法也大多都是在句法分析的基础上,尚不存在利用汉语句子的表层信息开展逗号自动分类的研究。提出了一种基于汉语句子的分词与词性标注信息做逗号自动分类的方法,并采用了两种有监督的机器学习分类器,即最大熵分类器和CRF分类器,来完成逗号的自动分类。在CTB 6.0语料上的实验表明,CRF的总体结果比最大熵的要好,而这两种分类器的分类精度都非常接近基于句法分析方法的分类精度。由此说明,基于词与词性做逗号分类的方法是可行的。

关键词: 汉语逗号分类, 最大熵, 条件随机场(CRF)

Abstract: In recent years,  punctuation as an important part of discourse is attracting more and more attention of the researchers. However, most methods are based on syntactic analysis. Research of Chinese comma classification using the surface information of Chinese sentences does not exist. This paper proposes a method for Chinese comma classification based on segmentation and part-of-speech tagging and adopts two supervised machine learning classifiers, namely the maximum entropy classifier and CRF classifier, to complete the  automatic classification of commas. Experimental results on the CTB 6.0 corpus show that CRF model is better than maximum entropy model, and the accuracy of the two classifiers are very close to the method based on syntactic analysis. It demonstrates that the method for Chinese comma classification based on segmentation and part-of-speech tagging is feasible.

Key words: Chinese comma classification, maximum entropy, Conditional Random Field(CRF)