引入数据平滑的增量式贝叶斯垃圾邮件过滤方法

计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (16): 21-25.

引入数据平滑的增量式贝叶斯垃圾邮件过滤方法

王祖辉，姜维

哈尔滨工业大学信息管理与信息系统研究所，哈尔滨 150001

出版日期:2012-06-01 发布日期:2012-06-01

Approach of improving incremental Bayes based spam filter by data amoothing

WANG Zuhui, JIANG Wei

Research Center of Information Management and Information System, Harbin Institute of Technology, Harbin 150001, China

Online:2012-06-01 Published:2012-06-01

摘要/Abstract

摘要： 朴素贝叶斯分类器在处理垃圾邮件过滤任务时，往往存在数据稀疏问题。由于语料库中特征出现遵循Zipf定律，所以单纯依靠增加训练语料方式难以解决该问题。为克服数据稀疏问题，引入数据平滑算法计算贝叶斯模型中缺失特征的补偿概率。通过领域术语抽取与概念相关模型增加分类中语义知识处理能力。采用增量式学习方法完成动态在线学习过程。Ling-Spam垃圾邮件语料库实验表明该方法提高分类精度2.51%，在国家863语料表明该方法比Laplace原则提高了3.05%。

关键词: 垃圾邮件过滤, 贝叶斯分类, 数据平滑

Abstract: When applied to deal with Spam Filter task, Na?ve Bayes almost suffers from the sparse data problem. Moreover, this problem is hardly to be solved by expanding the corpora, since the distribution of features in the corpora complies with the Zipf’s law. Three aspects of work are done to alleviate the above problem in this paper. Firstly, a smoothing algorithm is adopted and embedded into Na?ve Bayes to estimate the compensation probability of unseen feature. Secondly, domain term extraction and semantic knowledge are introduced in the Spam Filter model to enhance the performance of semantic process. Thirdly, an incremental learning method is introduced to perform the iterative learning. The experimental corpora comes from the Ling-Spam, and the result of open test shows that this method increases the precision by 2.51%. In addition, the experiment in National 863 Evaluation on Text Classification shows that the Na?ve Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace.

Key words: spam filter, Na?ve Bayes classification, data smoothing

王祖辉，姜维. 引入数据平滑的增量式贝叶斯垃圾邮件过滤方法[J]. 计算机工程与应用, 2012, 48(16): 21-25.

WANG Zuhui, JIANG Wei. Approach of improving incremental Bayes based spam filter by data amoothing[J]. Computer Engineering and Applications, 2012, 48(16): 21-25.

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	1	0	58

	来源	本网站

	次数	59
	比例	100%

摘要

最新录用	在线预览	正式出版

0	0	61

	来源	本网站

	次数	61
	比例	100%

[1]	张岁岁，黄丽霞，王杰，张雪英. 麦克风阵列下互相关函数分类的声源定位[J]. 计算机工程与应用, 2020, 56(4): 128-133.
[2]	丁娜，钟宝江. 手写液晶体数字及识别技术[J]. 计算机工程与应用, 2020, 56(16): 97-104.
[3]	戴敏. 基于NB分类器重访概率预测的Web缓存替换策略[J]. 计算机工程与应用, 2019, 55(19): 134-140.
[4]	王刚1，2，牛宏侠1，2. 融合全局与局部特征的贝叶斯人脸识别方法[J]. 计算机工程与应用, 2019, 55(11): 172-178.
[5]	冷翠平1，王双成1，2，高瑞1. 宏观经济风险分析的动态贝叶斯分类器方法[J]. 计算机工程与应用, 2016, 52(3): 224-229.
[6]	赵亮1，刘建辉2，崔彩峰2. 互信息匹配的半朴素贝叶斯分类器[J]. 计算机工程与应用, 2016, 52(18): 84-87.
[7]	陈念1，2，唐振民2. QBC主动采样学习在垃圾邮件在线过滤中的应用[J]. 计算机工程与应用, 2014, 50(22): 170-174.
[8]	胡德敏，胡金龙. 一种针对同音词伪装的反垃圾短信系统设计[J]. 计算机工程与应用, 2013, 49(2): 92-96.
[9]	周明伟，刘渊. 融合NBC与PNN的网络异常分类[J]. 计算机工程与应用, 2013, 49(17): 89-93.
[10]	牛丽平，郑延斌，曹西征. 基于分块Gabor特征的贝叶斯人脸识别[J]. 计算机工程与应用, 2013, 49(14): 199-202.
[11]	琚春华，殷贤君，许翀寰. 结合自助抽样的动态数据流贝叶斯分类算法[J]. 计算机工程与应用, 2011, 47(8): 118-121.
[12]	周开武，杨慧中. 贝叶斯分类器的关联向量机多模型软测量建模[J]. 计算机工程与应用, 2011, 47(5): 224-226.
[13]	张亚萍，陈得宝，侯俊钦，杨一军. 朴素贝叶斯分类算法的改进及应用[J]. 计算机工程与应用, 2011, 47(15): 134-137.
[14]	王涛¹，裘国永¹，冯涛². 应用精确代价因子的两层邮件过滤模型[J]. 计算机工程与应用, 2010, 46(34): 95-98.
[15]	张雯，张化祥. 属性加权的朴素贝叶斯集成分类器[J]. 计算机工程与应用, 2010, 46(29): 144-146.