计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (6): 204-207.

• 信号处理 • 上一篇    下一篇

基于粗分和词性标注的中文分词方法

姜  芳1,2,李国和1,2,3,岳  翔4,吴卫江1,2,3,洪云峰3,刘智渊3,程  远3   

  1. 1.中国石油大学(北京) 地球物理与信息工程学院,北京 102249
    2.中国石油大学(北京) 油气数据挖掘北京市重点实验室,北京 102249
    3.石大兆信数字身份管理与物联网技术研究院,北京 100029
    4.中海油研究总院 信息数据中心,北京 100029
  • 出版日期:2015-03-15 发布日期:2015-03-13

Segmentation of Chinese word based on method of rough segment and part of speech tagging

JIANG Fang1,2, LI Guohe1,2,3, YUE Xiang4, WU Weijiang1,2,3, HONG Yunfeng3, LIU Zhiyuan3, CHENG Yuan3   

  1. 1.College of Geophysics and Information Engineering, China University of Petroleum, Beijing 102249, China
    2.Beijing Key Lab of Data Mining for Petroleum Data, China University of Petroleum, Beijing 102249, China
    3.PanPass Institute of Digital Identification Management and Internet of Things, Beijing 100029, China
    4.Information & Data Center, CNOOC Research Institute, Beijing 100029, China
  • Online:2015-03-15 Published:2015-03-13

摘要: 中文分词是中文信息处理的重要内容之一。在基于最大匹配和歧义检测的粗分方法获取中文粗分结果集上,根据隐马尔可夫模型标注词性,通过Viterbi算法对每个中文分词的粗分进行词性标注。通过定义最优分词粗分的评估函数对每个粗分的词性标注进行粗分评估,获取最优的粗分为最终分词。通过实验对比,证明基于粗分和词性标注的中文分词方法具有良好的分词效果。

关键词: 分词, 词性标注, 隐马尔可夫模型, Viterbi算法

Abstract: The segmentation of Chinese words from text documents is one of important contents of Chinese information processing. After every segmentation of Chinese words is obtained by the Chinese word rough segmentation by maximum match and ambiguity detection algorithms, each word in every rough segmentation is tagged by Viterbi algorithm according to HMM model of part-of-speech tagging. Each rough segmentation is estimated by the definition of optimal estimation function of part-of-speech tagging, and then the best one is selected as the optimal segmentation. The segmentation presented is better than others by the comparison of experiments.

Key words: word segmentation, part-of-speech tagging, Hidden Markov Model(HMM), Viterbi algorithm