Segmentation of Chinese word based on method of rough segment and part of speech tagging

Computer Engineering and Applications ›› 2015, Vol. 51 ›› Issue (6): 204-207.

Previous Articles Next Articles

Segmentation of Chinese word based on method of rough segment and part of speech tagging

JIANG Fang1，2, LI Guohe1，2，3, YUE Xiang4, WU Weijiang1，2，3, HONG Yunfeng3, LIU Zhiyuan3, CHENG Yuan3

1.College of Geophysics and Information Engineering, China University of Petroleum, Beijing 102249, China
2.Beijing Key Lab of Data Mining for Petroleum Data, China University of Petroleum, Beijing 102249, China
3.PanPass Institute of Digital Identification Management and Internet of Things, Beijing 100029, China
4.Information & Data Center, CNOOC Research Institute, Beijing 100029, China

Online:2015-03-15 Published:2015-03-13

基于粗分和词性标注的中文分词方法

姜芳1，2，李国和1，2，3，岳翔4，吴卫江1，2，3，洪云峰3，刘智渊3，程远3

1.中国石油大学（北京）地球物理与信息工程学院，北京 102249
2.中国石油大学（北京）油气数据挖掘北京市重点实验室，北京 102249
3.石大兆信数字身份管理与物联网技术研究院，北京 100029
4.中海油研究总院信息数据中心，北京 100029

Abstract

Abstract: The segmentation of Chinese words from text documents is one of important contents of Chinese information processing. After every segmentation of Chinese words is obtained by the Chinese word rough segmentation by maximum match and ambiguity detection algorithms, each word in every rough segmentation is tagged by Viterbi algorithm according to HMM model of part-of-speech tagging. Each rough segmentation is estimated by the definition of optimal estimation function of part-of-speech tagging, and then the best one is selected as the optimal segmentation. The segmentation presented is better than others by the comparison of experiments.

Key words: word segmentation, part-of-speech tagging, Hidden Markov Model（HMM）, Viterbi algorithm

摘要： 中文分词是中文信息处理的重要内容之一。在基于最大匹配和歧义检测的粗分方法获取中文粗分结果集上，根据隐马尔可夫模型标注词性，通过Viterbi算法对每个中文分词的粗分进行词性标注。通过定义最优分词粗分的评估函数对每个粗分的词性标注进行粗分评估，获取最优的粗分为最终分词。通过实验对比，证明基于粗分和词性标注的中文分词方法具有良好的分词效果。

关键词: 分词, 词性标注, 隐马尔可夫模型, Viterbi算法

JIANG Fang1，2, LI Guohe1，2，3, YUE Xiang4, WU Weijiang1，2，3, HONG Yunfeng3, LIU Zhiyuan3, CHENG Yuan3. Segmentation of Chinese word based on method of rough segment and part of speech tagging[J]. Computer Engineering and Applications, 2015, 51(6): 204-207.

姜芳1，2，李国和1，2，3，岳翔4，吴卫江1，2，3，洪云峰3，刘智渊3，程远3. 基于粗分和词性标注的中文分词方法[J]. 计算机工程与应用, 2015, 51(6): 204-207.

[1]	WANG Wentao, LI Shumei, TANG Jie, LYU Weilong. DDoS Attack Detection Method Based on Probability Graph Model and DNN [J]. Computer Engineering and Applications, 2021, 57(13): 108-115.
[2]	WU Chutian, CHEN Yongle, CHEN Junjie. Cross-Protocol Anomaly Detection Algorithm Based on HMM [J]. Computer Engineering and Applications, 2020, 56(8): 81-86.
[3]	TU Wenbo, YUAN Zhenming, YU Kai. Convolutional Neural Networks Without Pooling Layer for Chinese Word Segmentation [J]. Computer Engineering and Applications, 2020, 56(2): 120-126.
[4]	XU Xuebin, Hornisa Mamat, Alim Aysa, ZHU Yali, Kurban Ubul. Word Segmentation of Uyghur Image Based on Clustering and Conjoined Segment Identification [J]. Computer Engineering and Applications, 2020, 56(14): 148-155.
[5]	LIU Chenhui, ZHANG Desheng, HU Gang. Research on Chinese Key Phrase Extraction Algorithm Based on TAKE [J]. Computer Engineering and Applications, 2020, 56(10): 115-121.
[6]	SUN Baoshan, LI Wei. Recurrent Neural Network for Chinese Word Segmentation with Peephole-Connections [J]. Computer Engineering and Applications, 2019, 55(19): 160-165.
[7]	WU Xiaoquan1，2, LI Hui1，2, CHEN Mei1，2, DAI Zhenyu1，2. DRVisSys： visualization recommendation system based on attribute correlation analysis [J]. Computer Engineering and Applications, 2018, 54(7): 251-256.
[8]	CHENG Yusi1, SHI Yuntao2. Domain specific Chinese word segmentation [J]. Computer Engineering and Applications, 2018, 54(17): 30-34.
[9]	PAN Li1，2, DENG Jia1, WANG Yongli1. HMM-Cluster: Trajectory clustering for discovering traffic volume overload [J]. Computer Engineering and Applications, 2018, 54(1): 77-85.
[10]	XU Chun1，2，3, YANG Yong4, JIANG Tonghai1. Research on machine translation based Uyghur morphological analysis [J]. Computer Engineering and Applications, 2017, 53(14): 138-142.
[11]	GE Yongkan, YU Fengqin . Improved speech synthesis with adaptive postfilter parameters [J]. Computer Engineering and Applications, 2017, 53(1): 168-171.
[12]	ZHAO Weifeng1，2, ZHANG Qin1. Automatic identification of address description in unstructured Chinese natural language [J]. Computer Engineering and Applications, 2016, 52(23): 19-24.
[13]	HU Yifan, HU Youbin, LI Qian, GENG Dongdong. Research on face detection, tracking and recognition system based on video surveillance [J]. Computer Engineering and Applications, 2016, 52(21): 1-7.
[14]	ZHU Yanhui, LIU Jing, XU Yeqiang, TIAN Hailong, MA Jin. Chinese word segmentation research based on Conditional Random Field [J]. Computer Engineering and Applications, 2016, 52(15): 97-100.
[15]	CHAI Qian, WANG Huiqin, LIAO Yuting, LU Ying, MA Zongfang. Flame recognition algorithm based on Hidden Markov Model and Support Vector Machines [J]. Computer Engineering and Applications, 2015, 51(13): 202-205.

Segmentation of Chinese word based on method of rough segment and part of speech tagging

基于粗分和词性标注的中文分词方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics