基于分词与词性标注的汉语逗号自动分类

计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (18): 120-125.

• 数据库、数据挖掘、机器学习 • 上一篇下一篇

基于分词与词性标注的汉语逗号自动分类

谷晶晶，周国栋

苏州大学计算机科学与技术学院，江苏苏州 215006

出版日期:2015-09-15 发布日期:2015-10-13

Chinese comma classification based on segmentation and part of speech tagging

GU Jingjing, ZHOU Guodong

School of Computer Science & Technology, Soochow University, Suzhou, Jiangsu 215006, China

Online:2015-09-15 Published:2015-10-13

摘要/Abstract

摘要： 近年来，标点符号作为篇章的重要部分逐渐引起研究者的关注。然而，针对汉语逗号的研究才刚刚展开，采用的方法也大多都是在句法分析的基础上，尚不存在利用汉语句子的表层信息开展逗号自动分类的研究。提出了一种基于汉语句子的分词与词性标注信息做逗号自动分类的方法，并采用了两种有监督的机器学习分类器，即最大熵分类器和CRF分类器，来完成逗号的自动分类。在CTB 6.0语料上的实验表明，CRF的总体结果比最大熵的要好，而这两种分类器的分类精度都非常接近基于句法分析方法的分类精度。由此说明，基于词与词性做逗号分类的方法是可行的。

关键词: 汉语逗号分类, 最大熵, 条件随机场（CRF）

Abstract: In recent years, punctuation as an important part of discourse is attracting more and more attention of the researchers. However, most methods are based on syntactic analysis. Research of Chinese comma classification using the surface information of Chinese sentences does not exist. This paper proposes a method for Chinese comma classification based on segmentation and part-of-speech tagging and adopts two supervised machine learning classifiers, namely the maximum entropy classifier and CRF classifier, to complete the automatic classification of commas. Experimental results on the CTB 6.0 corpus show that CRF model is better than maximum entropy model, and the accuracy of the two classifiers are very close to the method based on syntactic analysis. It demonstrates that the method for Chinese comma classification based on segmentation and part-of-speech tagging is feasible.

Key words: Chinese comma classification, maximum entropy, Conditional Random Field（CRF）

谷晶晶，周国栋. 基于分词与词性标注的汉语逗号自动分类[J]. 计算机工程与应用, 2015, 51(18): 120-125.

GU Jingjing, ZHOU Guodong. Chinese comma classification based on segmentation and part of speech tagging[J]. Computer Engineering and Applications, 2015, 51(18): 120-125.

[1]	田梓函，李欣. 基于BERT-CRF模型的中文事件检测方法研究[J]. 计算机工程与应用, 2021, 57(11): 135-139.
[2]	李博，康晓东，张华丽，王亚鸽，陈亚媛，白放. 采用Transformer-CRF的中文电子病历命名实体识别[J]. 计算机工程与应用, 2020, 56(5): 153-159.
[3]	刘小安，彭涛. 基于卷积神经网络的中文景点识别研究[J]. 计算机工程与应用, 2020, 56(4): 140-145.
[4]	周婉莹，马盈仓，续秋霞，郑毅. 最大熵和[l2,0]范数约束的无监督特征选择算法[J]. 计算机工程与应用, 2020, 56(11): 51-59.
[5]	陈建平，陈其强，傅启明，高振，吴宏杰，陆悠. 基于生成对抗网络的最大熵逆强化学习[J]. 计算机工程与应用, 2019, 55(22): 119-126.
[6]	夏吾吉1，2，华却才让1. 基于混合策略的藏文人称代词指代消解研究[J]. 计算机工程与应用, 2018, 54(7): 66-69.
[7]	杜玉龙，李建增，张岩，范聪. 基于深度交叉CNN和免交互GrabCut的显著性检测[J]. 计算机工程与应用, 2017, 53(3): 32-40.
[8]	邵良杉1，赵琳琳1，温廷新2，孔祥博2. 基于区间直觉模糊数的双向投影决策模型[J]. 计算机工程与应用, 2017, 53(1): 83-86.
[9]	刘颖，王楠. 最大熵模型和BP神经网络的短句对齐比较[J]. 计算机工程与应用, 2015, 51(7): 112-117.
[10]	古丽扎达·海沙1，古丽拉·阿东别克2，3. 哈萨克语动词短语自动识别研究与实现[J]. 计算机工程与应用, 2015, 51(2): 218-223.
[11]	吴鹏. 萤火虫算法优化最大熵的图像分割方法[J]. 计算机工程与应用, 2014, 50(12): 115-119.
[12]	汪国强，曲晶莹. 改进分水岭医学图像分割方法的研究[J]. 计算机工程与应用, 2013, 49(8): 185-187.
[13]	郑丽，吕学强. 搜索引擎日志中“N+V+N”、“V+N+N”型短语识别[J]. 计算机工程与应用, 2013, 49(6): 143-147.
[14]	桑海岩1，2，古丽拉·阿东别克1，2，牛宁宁1，2. 基于最大熵的哈萨克语词性标注模型[J]. 计算机工程与应用, 2013, 49(11): 126-129.
[15]	姑丽加玛丽·麦麦提艾力1，艾斯卡尔·肉孜2，艾斯卡尔·艾木都拉1. 维吾尔语多音词消歧混合方法[J]. 计算机工程与应用, 2011, 47(35): 158-160.

基于分词与词性标注的汉语逗号自动分类

Chinese comma classification based on segmentation and part of speech tagging

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics