计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (16): 159-161.DOI: 10.3778/j.issn.1002-8331.2009.16.046

• 数据库、信息处理 • 上一篇    下一篇

词间相关性在贝叶斯文本分类中的应用研究

章舜仲1,2,王树梅1,黄河燕3,陈肇雄3   

  1. 1.南京理工大学 计算机科学系,南京 210094
    2.南京财经大学 电子商务系,南京 210046
    3.中国科学院 计算机语言信息工程研究中心,北京 100083
  • 收稿日期:2008-04-01 修回日期:2008-06-11 出版日期:2009-06-01 发布日期:2009-06-01
  • 通讯作者: 章舜仲

Research on application of word correlation in Naive Bayes text classification

ZHANG Shun-zhong1,2,WANG Shu-mei1,HUANG He-yan3,CHEN Zhao-xiong3   

  1. 1.Department of Computer Science,Nanjing University of Science and Techology,Nanjing 210094,China
    2.Department of Electronic Business,Nanjing University of Finance and Economics,Nanjing 210046,China
    3.Computer Language Information Engineering Research Center,Chinese Academy of Sciences,Beijing 100083,China
  • Received:2008-04-01 Revised:2008-06-11 Online:2009-06-01 Published:2009-06-01
  • Contact: ZHANG Shun-zhong

摘要: 针对朴素贝叶斯分类的属性独立性假设的不足,讨论了相关性及多变量相关的概念,给出词间相关度的定义。在TAN分类器的词间相关性分析基础上,提出一种文档特征词相关度估计公式及其在改进朴素贝叶斯分类模型中应用的算法,在Reuters-21578文本数据集上的实验表明,改进算法简单易行,能有效改进贝叶斯分类性能。

关键词: 文本分类, 朴素贝叶斯, 事件相关, 相关度, 树扩展型朴素贝叶斯分类器

Abstract: Aiming at the deficiency of Naive Bayes’ attribute independence assumption,the concept of correlation and that between multi-variations were discussed,and the definition of correlation degree between terms was presented.Based on the analysis of the correlation between terms of TAN classifier,authors proposed a fomula to evaluate the correlation degree between document feature words and the algorithm of its application to ameliorating Naive Bayesian classifier.The experiments on Reuters-21578 collection show the improvement of algorithm to be simple,effective and easy to implement.

Key words: text classification, Naive Bayes, event correlation, correlation degree, Tree Augmented Naive Bayes(TAN) classifier