计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (4): 113-116.DOI: 10.3778/j.issn.1002-8331.2010.04.036

• 数据库、信号与信息处理 • 上一篇    下一篇

关联词约束的半监督文本分类方法

韩红旗1,2,朱东华1,刘 嵩1,汪雪锋1   

  1. 1.北京理工大学 管理与经济学院,北京 100081
    2.华北水利水电学院 管理与经济学院,郑州 450011
  • 收稿日期:2009-02-06 修回日期:2009-03-26 出版日期:2010-02-01 发布日期:2010-02-01
  • 通讯作者: 韩红旗

Semi-supervised text classification using class associated words

HAN Hong-qi1,2,ZHU Dong-hua1,LIU Song1,WANG Xue-feng1   

  1. 1.School of Management and Economics,Beijing Institute of Technology,Beijing 100081,China
    2.School of Management and Economics,North China University of Water Conservancy and Electric Power,Zhengzhou 450011,China
  • Received:2009-02-06 Revised:2009-03-26 Online:2010-02-01 Published:2010-02-01
  • Contact: HAN Hong-qi

摘要: 提出了一种没有训练集情况下实现对未标注类别文本文档进行分类的问题。类关联词是与类主体相关、能反映类主体的单词或短语。利用类关联词提供的先验信息,形成文档分类的先验概率,然后组合利用朴素贝叶斯分类器和EM迭代算法,在半监督学习过程中加入分类约束条件,用类关联词来监督构造一个分类器,实现了对完全未标注类别文档的分类。实验结果证明,此方法能够以较高的准确率实现没有训练集情况下的文本分类问题,在类关联词约束下的分类准确率要高于没有约束情况下的分类准确率。

Abstract: A problem is presented to classify unlabeled text documents without training set.Class associated words are the words which represent the subject of classes and provide prior knowledge for training a classifier.A learning algorithm,based on the combination of Expectation-Maximization(EM) and a Naïve Bayes classifier,is introduced to classify documents from fully unlabeled documents using class associated words.In the algorithm,class associated words are used to set classification constraints during learning process to restrict to classify documents into corresponding class labels and improve the classification accuracy. Experiment results show that the technique can solve the problem with much high accuracy,and the classification accuracy with constraints is higher than that without constraints.

中图分类号: