计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (15): 120-124.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于HMM的柯尔克孜语词性标注的研究

陈  莉,古丽拉·阿东别克   

  1. 新疆大学 信息科学与工程学院,乌鲁木齐 830046
  • 出版日期:2014-08-01 发布日期:2014-08-04

Research on Kirgiz language part of speech tagging based on HMM

CHEN Li, Gulila·ALTENBEK   

  1. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Online:2014-08-01 Published:2014-08-04

摘要: 柯尔克孜语的语言信息处理研究,对新疆柯尔克孜族是否能跨入信息时代,传承民族文化起着至关重要的作用。采用两级标注法,基于传统的HMM理论,改进了HMM模型参数的计算、数据平滑和未登入词的处理方法,更好地体现了上下文依赖关系。同时,把基于自动分词词典的词干提取算法与规则和统计相结合的方法用于柯尔克孜语的词性标注系统上。相对于传统的HMM,改进后的方法有效提高了准确性。

关键词: 柯尔克孜语, 自动分词词典, 隐马尔可夫模型(HMM), 词性标注

Abstract: Research on the Kirghiz information processing plays an important role to whether Xinjiang Kirghiz can enter the information age, and inherit the national culture. Based on the traditional HMM theory, this paper uses the two stage dimension method and improves the HMM parameters calculation, data-smoothing and unknown words, so it can reflect the context dependence better. Meanwhile, stem extraction algorithm, which is based on automatic words segmentation dictionary, with rules and statistics method is used for the using of Kirghiz part-of-speech tagging system. Compared to traditional HMM, the improved method is effective to enhance accuracy.

Key words: Kirghiz, automatic words segmentation dictionary, Hidden Markov Model(HMM), part-of-speech tagging