Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (27): 119-122.

Previous Articles     Next Articles

Study on information gain-based feature selection in Chinese text categorization

GUO Yawei, LIU Xiaoxia   

  1. College of Information Science & Technology, Northwest University,Xi’an 710127, China
  • Online:2012-09-21 Published:2012-09-24

文本分类中信息增益特征选择方法的研究

郭亚维,刘晓霞   

  1. 西北大学 信息科学与技术学院,西安 710127

Abstract: The feature selection method of traditional Information Gain(IG) ignoring the shortcoming of distributing information inside class and between classes is analysed. Distribution information inside class and concentration information between classes are introduced, which is  used to distinguish characteristics of strong correlation with class. Considering the problem of the feature selection method of traditional Information Gain(IG) not well combining positive feature and negative feature, the ratio of positive feature and negative feature is introduced with proportional factor to balance the effect of feature appear and disappear, which decreases the effect of negative feature on the corpus of category uneven distribution and increases classification effect. The experimental results verify the efficiency and probability of the improved IG approach.

Key words: text categorization, information gain, feature selection, distribution information inside class, Concentration information between classes, proportional factor

摘要: 分析了传统信息增益(IG)特征选择方法忽略了特征项在类间、类内分布信息的缺点,引入类内分散度、类间集中度等因素,区分与类强相关的特征;针对传统信息增益(IG)特征选择方法没有很好组合正相关特征和负相关特征的问题,引入比例因子来平衡特征出现和不出现时的信息量,降低在不平衡语料集上负相关特征的比例,提高分类效果。通过实验证明了改进的信息增益特征选择方法的有效性和可行性。

关键词: 文本分类, 信息增益, 特征选择, 类内分散度, 类间集中度, 比例因子