计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (22): 170-174.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

QBC主动采样学习在垃圾邮件在线过滤中的应用

陈  念1,2,唐振民2   

  1. 1.池州学院 数学与计算机科学系,安徽 池州 247000
    2.南京理工大学 计算机科学与工程学院,南京 210094
  • 出版日期:2014-11-15 发布日期:2014-11-13

Method of spam filtering online based on QBC active sampling learning algorithm

CHEN Nian1,2, TANG Zhenmin2   

  1. 1.Department of Mathematics and Computer Science, Chizhou College, Chizhou, Anhui 247000, China
    2.College of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
  • Online:2014-11-15 Published:2014-11-13

摘要: 针对垃圾邮件在线过滤的实际应用,在委员会投票算法采样学习的基础上,提出动态提升采样门槛,在无标签样本池中阶梯式获取高信息量训练样本的方法。该方法能够在稳定识别精度的前提下,进一步降低用于标注和学习的样本数量,压缩由此带来的时间成本。通过在UCI的Spambase数据集上仿真,证明了该方法在改善学习效率方面的有效性。

关键词: 垃圾邮件过滤, 版本空间, 主动学习, 投票熵, 委员会投票算法

Abstract: A method is put forward in the paper which can get informative samples from unlabeled-sample pool with stepped way. The method which is based on query-by-committee algorithm increases the sampling threshold dynamically and it is in order to solve the problem of spam filtering online. Through the new method, the number of samples which is used for labeling and training is further reduced and the accuracy of classifier can remain stable. By experiments on Spambase datasets, the effectiveness which can improve efficiency of machine learning is certificated.

Key words: spam filtering, version space, active learning, vote entropy, query-by-committee algorithm