Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (13): 24-26.

• 博士论坛 • Previous Articles     Next Articles

Method of feature selection for text categorization with bayesian classifiers

CHEN Jing-nian1,2,HUANG Hou-kuan1,TIAN Feng-zhan1,QU You-li1   

  1. 1.School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
    2.Department of Information and Computing Science,Shandong University of Finance,Ji’nan 250014,China
  • Received:2007-12-12 Revised:2008-01-21 Online:2008-05-01 Published:2008-05-01
  • Contact: CHEN Jing-nian

一种用于贝叶斯分类器的文本特征选择方法

陈景年1,2,黄厚宽1,田凤占1,瞿有利1   

  1. 1.北京交通大学 计算机与信息技术学院,北京 100044
    2.山东财政学院 信息与计算科学系,济南 250014
  • 通讯作者: 陈景年

Abstract: Feature selection is an important preprocessing technology in text classification.It can improve the efficiency and accuracy of a text classifier.The key of feature selection in text classification is to find an effective feature evaluation metric.In general,the effect of a feature evaluation metric for various classifiers can be very different,and thus a good feature evaluation metric should consider classifier characteristics.As the Naïve Bayesian classifier is very simple and efficient and highly sensitive to feature selection,so the research of feature selection specially for it is important.This paper presents a feature evaluation metric for the Naïve Bayesian classifier applied on multi-class text datasets:Class Discriminating Measure(CDM).Experiments of text classification with Naïve Bayesian classifiers were carried out on two multi-class texts collections.As the results indicate,CDM gains obviously better selecting effect than other feature selection approaches.

Key words: text classification, feature selection, text preprocessing, Naï, ve Bayes

摘要: 特征选择是文本分类中一种重要的文本预处理技术,它能够有效地提高分类器的精度和效率。文本分类中特征选择的关键是寻求有效的特征评价指标。一般来说,同一个特征评价指标对不同的分类器,其效果不同,由此,一个好的特征评价指标应当考虑分类器的特点。由于朴素贝叶斯分类器简单、高效而且对特征选择很敏感,因此,对用于该种分类器的特征选择方法的研究具有重要的意义。有鉴于此,提出了一种有效的用于贝叶斯分类器的多类别文本特征评价指标:CDM。利用贝叶斯分类器在两个多类别的文本数据集上进行了实验。实验结果表明提出的CDM指标具有比其它特征评价指标更好的特征选择效果。

关键词: 文本分类, 特征选择, 文本预处理, 朴素贝叶斯