Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (27): 90-93.

Previous Articles     Next Articles

Mixed spam feature selection approach based on information gain

YAN Qiao1, LENG  Chengchao2   

  1. 1.College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China
    2.College of Information Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China
  • Online:2012-09-21 Published:2012-09-24

基于信息增益的混合垃圾邮件特征选择方法

闫  巧1,冷成朝2   

  1. 1.深圳大学 计算机与软件学院,广东 深圳 518060
    2.深圳大学 信息工程学院,广东 深圳 518060

Abstract: Feature selection is a crucial process of spam filtering. The result of feature selection not only affects the accuracy of classification, but also affects the computational burden. The popular feature selection methods such as CHI selection, information gain, mutual information and SVM feature selection are compared and a mixed email feature selection method is proposed based on information gain using the conditional probability and classification discrimination between features to rudce redundancy among features to overcome their shortcoming that only pay attention to sorting yet ignore the redundancy among features. Experimental results show that: the new method is promising and improves classification accuracy of spam.

Key words: feature selection, CHI, Information Gain(IG), Support Vector Machine(SVM)

摘要: 特征选择是邮件过滤重要的环节,特征的好坏不仅影响分类的准确率,还直接影响到分类器训练和分类的开销。比较了常用的CHI选择、互信息(MI)、信息增益(IG)和SVM 特征选择算法在垃圾邮件过滤中的效果,针对这些方法只排序而未消除特征间冗余的缺点,提出了利用特征词间条件概率和分类区分度消除冗余的混合邮件特征选择方法。实验结果表明:方法效果良好,提高了邮件分类准确率。

关键词: 特征选择, 卡方检验(CHI), 信息增益(IG), 支持向量机(SVM)