Chinese comma classification based on segmentation and part of speech tagging

Abstract

Abstract: In recent years, punctuation as an important part of discourse is attracting more and more attention of the researchers. However, most methods are based on syntactic analysis. Research of Chinese comma classification using the surface information of Chinese sentences does not exist. This paper proposes a method for Chinese comma classification based on segmentation and part-of-speech tagging and adopts two supervised machine learning classifiers, namely the maximum entropy classifier and CRF classifier, to complete the automatic classification of commas. Experimental results on the CTB 6.0 corpus show that CRF model is better than maximum entropy model, and the accuracy of the two classifiers are very close to the method based on syntactic analysis. It demonstrates that the method for Chinese comma classification based on segmentation and part-of-speech tagging is feasible.

Key words: Chinese comma classification, maximum entropy, Conditional Random Field（CRF）

摘要： 近年来，标点符号作为篇章的重要部分逐渐引起研究者的关注。然而，针对汉语逗号的研究才刚刚展开，采用的方法也大多都是在句法分析的基础上，尚不存在利用汉语句子的表层信息开展逗号自动分类的研究。提出了一种基于汉语句子的分词与词性标注信息做逗号自动分类的方法，并采用了两种有监督的机器学习分类器，即最大熵分类器和CRF分类器，来完成逗号的自动分类。在CTB 6.0语料上的实验表明，CRF的总体结果比最大熵的要好，而这两种分类器的分类精度都非常接近基于句法分析方法的分类精度。由此说明，基于词与词性做逗号分类的方法是可行的。

关键词: 汉语逗号分类, 最大熵, 条件随机场（CRF）

GU Jingjing, ZHOU Guodong. Chinese comma classification based on segmentation and part of speech tagging[J]. Computer Engineering and Applications, 2015, 51(18): 120-125.

谷晶晶，周国栋. 基于分词与词性标注的汉语逗号自动分类[J]. 计算机工程与应用, 2015, 51(18): 120-125.

[1]	TIAN Zihan, LI Xin. Research on Chinese Event Detection Method Based on BERT-CRF Model [J]. Computer Engineering and Applications, 2021, 57(11): 135-139.
[2]	LIU Xiaoan, PENG Tao. Research on Chinese Scenic Spot Named Entity Recognition Based on Convolutional Neural Network [J]. Computer Engineering and Applications, 2020, 56(4): 140-145.
[3]	ZHOU Wanying, MA Yingcang, XU Qiuxia, ZHENG Yi. Unsupervised Feature Selection Algorithm Based on Maximum Entropy and [l2,0] Norm Constraints [J]. Computer Engineering and Applications, 2020, 56(11): 51-59.
[4]	YONG Qiaoling, YI Junyan. Elastic Net Algorithm with Dynamic Characteristics for Clustering [J]. Computer Engineering and Applications, 2019, 55(8): 102-109.
[5]	CHEN Jianping, CHEN Qiqiang, FU Qiming, GAO Zhen, WU Hongjie, LU You. Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks [J]. Computer Engineering and Applications, 2019, 55(22): 119-126.
[6]	XIA Wuji1，2, HUAQUE Cairang1. Research of tibetan personal pronouns anaphora resolution based on mixed strategy [J]. Computer Engineering and Applications, 2018, 54(7): 66-69.
[7]	DU Yulong, LI Jianzeng, ZHANG Yan, FAN Cong. Saliency detection based on deep cross CNN and non-interaction GrabCut [J]. Computer Engineering and Applications, 2017, 53(3): 32-40.
[8]	SHAO Liangshan1, ZHAO Linlin1, WEN Tingxin2, KONG Xiangbo2. Bidirectional projection method with interval-valued intuitionistic fuzzy number [J]. Computer Engineering and Applications, 2017, 53(1): 83-86.
[9]	WU Bin, MA Jitao, WU Ping. Expert random selection algorithm based on information entropy [J]. Computer Engineering and Applications, 2016, 52(5): 119-121.
[10]	ZHU Yanhui, LIU Jing, XU Yeqiang, TIAN Hailong, MA Jin. Chinese word segmentation research based on Conditional Random Field [J]. Computer Engineering and Applications, 2016, 52(15): 97-100.
[11]	LIU Ying, WANG Nan. Comparison of clause alignment based on maximum entropy model and Back Propagation neural network model [J]. Computer Engineering and Applications, 2015, 51(7): 112-117.
[12]	KANG Caijun1, LONG Congjun2, JIANG Di1，2. Tibetan names recognition research based on CRF [J]. Computer Engineering and Applications, 2015, 51(3): 109-111.
[13]	GULIZADA·Haisa1, GULILA·Altenbek2，3. Research on automatic identification of base verb phrases in Kazakh [J]. Computer Engineering and Applications, 2015, 51(2): 218-223.
[14]	WU Peng. Image segmentation method based on firefly algorithm and maximum entropy method [J]. Computer Engineering and Applications, 2014, 50(12): 115-119.
[15]	KANG Caijun1, LONG Congjun2，3, JIANG Di1，2. Segmentation of Tibetan abbreviated forms based on word position [J]. Computer Engineering and Applications, 2014, 50(11): 218-222.

Chinese comma classification based on segmentation and part of speech tagging

基于分词与词性标注的汉语逗号自动分类

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics