计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (16): 21-25.

• 博士论坛 • 上一篇    下一篇

引入数据平滑的增量式贝叶斯垃圾邮件过滤方法

王祖辉,姜  维   

  1. 哈尔滨工业大学 信息管理与信息系统研究所,哈尔滨 150001
  • 出版日期:2012-06-01 发布日期:2012-06-01

Approach of improving incremental Bayes based spam filter by data amoothing

WANG Zuhui, JIANG Wei   

  1. Research Center of Information Management and Information System, Harbin Institute of Technology, Harbin 150001, China
  • Online:2012-06-01 Published:2012-06-01

摘要: 朴素贝叶斯分类器在处理垃圾邮件过滤任务时,往往存在数据稀疏问题。由于语料库中特征出现遵循Zipf定律,所以单纯依靠增加训练语料方式难以解决该问题。为克服数据稀疏问题,引入数据平滑算法计算贝叶斯模型中缺失特征的补偿概率。通过领域术语抽取与概念相关模型增加分类中语义知识处理能力。采用增量式学习方法完成动态在线学习过程。Ling-Spam垃圾邮件语料库实验表明该方法提高分类精度2.51%,在国家863语料表明该方法比Laplace原则提高了3.05%。

关键词: 垃圾邮件过滤, 贝叶斯分类, 数据平滑

Abstract: When applied to deal with Spam Filter task, Na?ve Bayes almost suffers from the sparse data problem. Moreover, this problem is hardly to be solved by expanding the corpora, since the distribution of features in the corpora complies with the Zipf’s law. Three aspects of work are done to alleviate the above problem in this paper. Firstly, a smoothing algorithm is adopted and embedded into Na?ve Bayes to estimate the compensation probability of unseen feature. Secondly, domain term extraction and semantic knowledge are introduced in the Spam Filter model to enhance the performance of semantic process. Thirdly, an incremental learning method is introduced to perform the iterative learning. The experimental corpora comes from the Ling-Spam, and the result of open test shows that this method increases the precision by 2.51%. In addition, the experiment in National 863 Evaluation on Text Classification shows that the Na?ve Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace.

Key words: spam filter, Na?ve Bayes classification, data smoothing