计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (5): 110-112.

• 数据库、信号与信息处理 • 上一篇    下一篇

基于机器学习的维吾尔文文本分类研究

阿力木江·艾沙1,2,吐尔根·依布拉音2,艾山·吾买尔2,马尔哈巴·艾力2   

  1. 1.新疆大学 现代教育技术中心,乌鲁木齐 830046
    2.新疆大学 信息科学与工程学院,乌鲁木齐 830046
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2012-02-11 发布日期:2012-02-11

Machine learning based Uyghur language text categorization

Alimjan AYSA1,2, Turgun IBRAHIM2, Hasan OMAR2, Marhaba ALI2   

  1. 1.Modern Education Technology Center, Xinjiang University, Urumqi 830046, China
    2.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2012-02-11 Published:2012-02-11

摘要: 随着Internet上维吾尔文信息的迅速发展,维吾尔文文本分类成为处理和组织这些大量文本数据的关键技术。研究维吾尔文文本分类相关技术和方法,针对维吾尔文文本在向量空间模型(VSM)表示下的高维性,采用词干提取和IG相结合的方法对表示空间进行降维。采用基于机器学习的分类算法(kNN和Na?ve Bayes)对维吾尔文文本语料进行了分类实验并分析了实验结果。

关键词: 文本分类, 朴素贝叶斯方法, k-最近邻方法(kNN), 维吾尔语, 特征选择

Abstract: With the rapid increase of Uyghur language text information on the Internet, Uyghur language text categorization has become a key technique for processing and organizing these text data. As to the high dimensionality of Uyghur language texts under vector space model representation, the stemming technique is used along with IG to reduce the dimensionality. The categorization experiments are performed using machine learning based text categorization algorithms such as Na?ve Bayes and kNN on Uyghur language text corpus and the experimental results are analyzed.

Key words: text categorization, Na?ve Bayes, k-Nearest Neighbor(kNN), Uyghur language, feature selection