计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (25): 1-4.

• 博士论坛 • 上一篇    下一篇

一种基于互信息的改进文本特征选择

刘海峰1,陈  琦1,张以皓2   

  1. 1.解放军理工大学 理学院,南京 210007
    2.解放军理工大学 指挥自动化学院,南京 210007
  • 出版日期:2012-09-01 发布日期:2012-08-30

Improved mutual information method of feature selection in text categorization

LIU Haifeng1, CHEN Qi1, ZHANG Yihao2   

  1. 1.Institute of Sciences, PLA University of Science and Technology, Nanjing 210007, China
    2.Institute of Command Automation, PLA University of Science and Technology, Nanjing 210007, China
  • Online:2012-09-01 Published:2012-08-30

摘要: 提出了一种优化互信息文本特征选择方法。针对互信息模型的不足之处主要从三方面进行改进:用权重因子对正、负相关特征加以区分;以修正因子的方式在MI中引入词频信息对低频词进行抑制;针对特征项在文本里的位置差异进行基于位置的特征加权。该方法改善了MI模型的特征选择效率。文本分类实验结果验证了提出的优化互信息特征选择方法的合理性与有效性。

关键词: 文本分类, 特征选择, 互信息, 特征降维

Abstract: This paper puts forward a kind of optimizing Mutual Information(MI) text characteristic selection method. Aiming at the MI’s deficiencies, it puts forward three approaches to improvement. The positive and negative features with the weight factors are distinguished. Through the introduction of the correct factors way, the low-frequency word is realized to restrain. According to the features position in the text, a further weighted method is put forward. In this way, the paper has improved the efficiency of MI model. Subsequent text classification experimental results show the proposed optimization MI and rationality of the method is effective.

Key words: Text Categorization(TC), feature selection, Mutual Information(MI), feature reduction