计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (25): 121-124.DOI: 10.3778/j.issn.1002-8331.2009.25.037

• 数据库、信息处理 • 上一篇    下一篇

垃圾邮件处理中LDA特征选择方法

袁伯秋,周一民,李 林   

  1. 北京航天航空大学 计算机学院,北京 100083
  • 收稿日期:2008-10-22 修回日期:2009-01-08 出版日期:2009-09-01 发布日期:2009-09-01
  • 通讯作者: 袁伯秋

LDA based feature selection for spam filter

YUAN Bo-qiu,ZHOU Yi-min,LI Lin   

  1. School of Computer Science and Technology,Beijing University of Aeronautics and Astronautics,Beijing 100083,China
  • Received:2008-10-22 Revised:2009-01-08 Online:2009-09-01 Published:2009-09-01
  • Contact: YUAN Bo-qiu

摘要: 垃圾邮件处理是一项长期研究课题,越来越多的文本分类技术被移植到垃圾邮件处理应用当中。LDA(Latent Dirichlet Allocation)等topic模型在自动摘要、信息获取和其他离散数据应用中受到越来越多的关注。将LDA模型作为一种特征选择方法,引入垃圾邮件处理应用中。将LDA特征选择方法与质心+KNN分类器结合,得到简单的测试用垃圾邮件过滤器。初步实验结果表明,基于LDA的特征选择方法优于通常的IG、MI特征选择方法;测试过滤器的过滤性能与其他过滤器相当。

关键词: 垃圾邮件过滤, 一种话题模型(LDA), 特征选择

Abstract: Spam filtering is a long-drawn research issue.More and more text categorization techniques are replanted for spam filtering.Latent Dirichlet Allocation(LDA) and other related topic models are increasingly popular tools for summarization,manifold discovery and other application in discrete data.The LDA is introduced into spam filtering as a feature selection tool.Combined the LDA with a simple centroid-based + kNN classifier,a test spam filter is got.The experiment result shows that the features selected by LDA outperform the baseline features selected by IG and MI,and the test filter is comparative to other filters.

Key words: spam filter, Latent Dirichlet Allocation(LDA), feature selection

中图分类号: