计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (29): 176-178.

• 数据库与信息处理 • 上一篇    下一篇

基于LZ复杂性相似度的垃圾邮件识别

李 斌1,李义兵1,2,何红波1,2   

  1. 1.中南大学 信息科学与工程学院,长沙 410083
    2.中南大学 物理科学与技术学院,长沙 410083
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-10-11 发布日期:2007-10-11
  • 通讯作者: 李 斌

LZ complexity similarity based spam detection

LI Bin1,LI Yi-bing1,2,HE Hong-bo1,2   

  1. 1.School of Information Science and Engineering,Central South University,Changsha 410083,China
    2.School of Physics Science and Technology,Central South University,Changsha 410083,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-10-11 Published:2007-10-11
  • Contact: LI Bin

摘要: 提出一个基于符号序列间LZ复杂性相似度的垃圾邮件识别方法。相比基于向量空间模型的邮件识别,邮件文本间的LZ复杂性相似度计算无需对文本进行预处理和特征提取。同时,K近邻规则的延迟学习特性适合于垃圾邮件样本需要动态调整的应用环境。在Ling-Spam邮件语料集上对提出的识别方法进行十重交叉验证,其总体的识别效果优于基于向量空间模型的部分统计和机器学习方法。

关键词: 垃圾邮件, LZ复杂性相似度, K近邻规则

Abstract: A spam detection method is proposed based on the LZ complexity similarity of symbolic sequences and K nearest neighbor rule.Compared to approaches based on vector space model,the calculation of the LZ complexity similarity between email documents requires neither text preprocessing nor feature extraction.The lazy learning characteristic of K nearest neighbor rule facilitates the application environment that the spam sample set needs to be adjusted dynamically.The proposed method has been tested on the Ling-Spam dataset using a 10-Fold cross validation.The total detection effect is better than the results of some contrast methods based on vector space model.

Key words: spam, LZ complexity similarity, K nearest neighbor rule