计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (4): 113-117.DOI: 10.3778/j.issn.1002-8331.1507-0240

• 大数据与云计算 • 上一篇    下一篇

基于词频分布信息的优化IG特征选择方法

刘海峰,刘守生,宋阿羚   

  1. 解放军理工大学 理学院,南京 210007
  • 出版日期:2017-02-15 发布日期:2017-05-11

Improved method of IG feature selection based on word frequency distribution

LIU Haifeng, LIU Shousheng, SONG Aling   

  1. Institute of Sciences, PLA University of Science and Technology, Nanjing 210007, China
  • Online:2017-02-15 Published:2017-05-11

摘要: 文本特征选择是文本分类的核心技术。针对信息增益模型的不足之处,以特征项的频数在文本中不同层面的分布为依据,分别从特征项基于文本的类内分布、基于词频的类内分布以及词频的类间分布等角度对IG模型逐步进行改进,提出了一种基于词频分布信息的优化IG特征选择方法。随后的文本分类实验验证了提出的优化IG模型的有效性。

关键词: 信息增益, 特征选择, 类内分布, 类间分布, 文本分类

Abstract: Text feature selection is the core technology of text classification. Based on the deficiency of information gain model, the IG model has been improved step by step according to the feature items distribution within the class and between the classes. A kind of optimazation of IG feature selection method based on word frequency division information is presented. The text categorization test verifies the effectiveness of the proposed optimization IG model.

Key words: information gain, feature selection, distribution within class, distribution between class, text categorization