Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (11): 126-129.

Previous Articles     Next Articles

Kazakh part-of-speech tagging method based on maximum entropy

SANG Haiyan1,2, Gulia·Altenbek1,2, NIU Ningning1,2   

  1. 1.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
    2.The Base of Kazakh and Kirghiz Language, Minority Languages Branch, National Language Resource Monitoring and Research Center, Urumqi 830046, China
  • Online:2013-06-01 Published:2013-06-14

基于最大熵的哈萨克语词性标注模型

桑海岩1,2,古丽拉·阿东别克1,2,牛宁宁1,2   

  1. 1.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2.国家语言资源监测与研究中心 少数民族语言中心 哈萨克和柯尔克孜语文基地,乌鲁木齐 830046

Abstract: Maximum entropy model can make full use of context, agilely take multiple characteristics. This paper uses maximum entropy model to part of speech tagging of Kazakh, designs feature template according to tackiness and rich shape, and joins the backward relying part of speech feature template. In this paper, the module is improved, which takes the previous n words of highest probability to join the characteristic vector of next word and so on until the end of the sentence, and finally it selects a probability optimal sequence of part of speech tagging. The results show that feature template choice is correct, and the improved model accuracy rate reaches 96.8%.

Key words: natural language processing, part-of-speech tagging, maximum entropy model, Kazakh

摘要: 最大熵模型能够充分利用上下文,灵活取用多个特征。使用最大熵模型进行哈萨克语的词性标注,根据哈语的粘着性、形态丰富等特点设计特征模板,并加入了向后依赖词性的特征模板。对模型进行了改进,在解码中取概率最大的前n个词性分别加入下一个词的特征向量中,以此类推直至句子结束,最终选出一条概率最优的词性标注序列。实验结果表明,特征模板的选择是正确的,改进模型的准确率达到了96.8%。

关键词: 自然语言处理, 词性标注, 最大熵模型, 哈萨克语