Computer Engineering and Applications ›› 2007, Vol. 43 ›› Issue (25): 128-132.

• 网络、通信与安全 • Previous Articles     Next Articles

Simplified Chinese spam mail filter:design and performance evaluation

LI Wei-jie,XU Yong   

  1. Bio-Computing Research Center,Shenzhen Graduate School,Harbin Institute of Technology,Shenzhen,Guangdong 518005,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-09-01 Published:2007-09-01
  • Contact: LI Wei-jie

简体中文垃圾邮件分类的实验设计及对比研究

李维杰,徐 勇   

  1. 哈尔滨工业大学 深圳研究生院 生物计算研究中心,广东 深圳 518005
  • 通讯作者: 李维杰

Abstract: Paths to solving and methods of filtering unsolicited bulk e-mails,also known as spam,have been analyzed.And the method based on keyword and the statistical learning have been analyzed.Then a new method which is a combination of the two methods have been proposed.The method to filter spam using the na?觙ve Bayesian decision theory,the nearest-neighbor classification,and the linear classification based the perceptron criterion function which is used in pattern classification has been introduced.The feature set used in the three theories have been gotten by mutual information.By comparied the three decision theories,the advantages and disadvantages of them has been presented.At same time,a good idea to filtering spam using mutual information has been pointed out in the paper.

Key words: spam mail, classification, Bayesian decision, nearest-neighbor decision, perceptron criterion function

摘要: 综合分析了垃圾邮件过滤的技术路线与方法,并在分析基于关键字的方法和统计学的方法的基础上,提出了将两者相结合,运用模式识别中的贝叶斯、最近邻和感知机等分类方法,实现对垃圾邮件的过滤的技术路线。以互信息最大化准则筛选出的特征集为基础,对不同分类技术的对比分析揭示了贝叶斯、最近邻和感知机在垃圾邮件过滤应用上的优劣。同时,文中对基于互信息最大化准则的垃圾邮件过滤应用提出了有益的思路。

关键词: 垃圾邮件, 分类器, 贝叶斯, 最近邻, 感知机