基于LZ复杂性相似度的垃圾邮件识别

计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (29): 176-178.

基于LZ复杂性相似度的垃圾邮件识别

李斌¹,李义兵^1,2,何红波^1,2

1.中南大学信息科学与工程学院,长沙 410083
2.中南大学物理科学与技术学院,长沙 410083

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-10-11 发布日期:2007-10-11
通讯作者: 李斌

LZ complexity similarity based spam detection

LI Bin¹,LI Yi-bing^1,2,HE Hong-bo^1,2

1.School of Information Science and Engineering,Central South University,Changsha 410083,China
2.School of Physics Science and Technology,Central South University,Changsha 410083,China

Received:1900-01-01 Revised:1900-01-01 Online:2007-10-11 Published:2007-10-11
Contact: LI Bin

摘要/Abstract

摘要： 提出一个基于符号序列间LZ复杂性相似度的垃圾邮件识别方法。相比基于向量空间模型的邮件识别,邮件文本间的LZ复杂性相似度计算无需对文本进行预处理和特征提取。同时,K近邻规则的延迟学习特性适合于垃圾邮件样本需要动态调整的应用环境。在Ling-Spam邮件语料集上对提出的识别方法进行十重交叉验证,其总体的识别效果优于基于向量空间模型的部分统计和机器学习方法。

关键词: 垃圾邮件, LZ复杂性相似度, K近邻规则

Abstract: A spam detection method is proposed based on the LZ complexity similarity of symbolic sequences and K nearest neighbor rule.Compared to approaches based on vector space model,the calculation of the LZ complexity similarity between email documents requires neither text preprocessing nor feature extraction.The lazy learning characteristic of K nearest neighbor rule facilitates the application environment that the spam sample set needs to be adjusted dynamically.The proposed method has been tested on the Ling-Spam dataset using a 10-Fold cross validation.The total detection effect is better than the results of some contrast methods based on vector space model.

Key words: spam, LZ complexity similarity, K nearest neighbor rule

李斌¹,李义兵^1,2,何红波^1,2. 基于LZ复杂性相似度的垃圾邮件识别[J]. 计算机工程与应用, 2007, 43(29): 176-178.

LI Bin¹,LI Yi-bing^1,2,HE Hong-bo^1,2. LZ complexity similarity based spam detection[J]. Computer Engineering and Applications, 2007, 43(29): 176-178.

[1]	陈念1，2，唐振民2. QBC主动采样学习在垃圾邮件在线过滤中的应用[J]. 计算机工程与应用, 2014, 50(22): 170-174.
[2]	薛正元. 基于改进贝叶斯决策的邮件过滤[J]. 计算机工程与应用, 2013, 49(7): 98-101.
[3]	翟军昌1，秦玉平1，车伟伟2. 应用特征词分类贡献的垃圾邮件过滤研究[J]. 计算机工程与应用, 2012, 48(34): 116-119.
[4]	王祖辉，姜维. 引入数据平滑的增量式贝叶斯垃圾邮件过滤方法[J]. 计算机工程与应用, 2012, 48(16): 21-25.
[5]	黄珏1，陈兵2，廖常武1. 改进的人工免疫垃圾邮件过滤算法[J]. 计算机工程与应用, 2011, 47(30): 72-74.
[6]	王涛¹，裘国永¹，冯涛². 应用精确代价因子的两层邮件过滤模型[J]. 计算机工程与应用, 2010, 46(34): 95-98.
[7]	孙名松，高庆国，王宣丹. 基于双隶属度模糊支持向量机的邮件过滤[J]. 计算机工程与应用, 2010, 46(2): 93-95.
[8]	秦玉平¹，耿姝¹，孙宗宝². 基于C-SVM和KPCA的垃圾邮件检测研究[J]. 计算机工程与应用, 2010, 46(19): 94-96.
[9]	刘延华，陈国龙. 中文垃圾邮件多层次过滤技术的应用研究[J]. 计算机工程与应用, 2009, 45(34): 94-97.
[10]	袁伯秋，周一民，李林. 垃圾邮件处理中LDA特征选择方法[J]. 计算机工程与应用, 2009, 45(25): 121-124.
[11]	邓春燕^1,3,陶多秀²,吕跃进³. 粗糙集与决策树在电子邮件分类与过滤中的应用[J]. 计算机工程与应用, 2009, 45(16): 138-140.
[12]	万明成,耿技,程红蓉,曾志华. 垃圾邮件图像中的文字角点检测[J]. 计算机工程与应用, 2009, 45(14): 170-172.
[13]	翟军昌^1,2,秦玉平²,王春立³. 改进的朴素贝叶斯垃圾邮件过滤算法[J]. 计算机工程与应用, 2009, 45(14): 145-148.
[14]	路梅^1,2,叶澄清². 协同分类器及其在邮件过滤中的应用[J]. 计算机工程与应用, 2008, 44(4): 135-137.
[15]	闫鹏^1,2,郑雪峰¹,李明祥¹,陈松华². 关于贝叶斯推理的垃圾邮件特征选择评估函数[J]. 计算机工程与应用, 2008, 44(33): 105-107.