计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (35): 156-158.

• 数据库与信息处理 • 上一篇    下一篇

基于信息熵的改进TFIDF特征选择算法

周炎涛1,2,唐剑波1,王家琴1   

  1. 1.湖南大学 电气与信息工程学院,长沙 410082
    2.海军工程大学 信息与电气学院,武汉 430033
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-12-11 发布日期:2007-12-11
  • 通讯作者: 周炎涛

Improved TFIDF feature selection algorithm based on information entropy

ZHOU Yan-tao1,2,TANG Jian-bo1,WANG Jia-qin1   

  1. 1.College of Electrical and Information Engineering,Hunan University,Changsha 410082,China
    2.Information and Electrical Engineering College,Naval Engineering University,Wuhan 430033,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-12-11 Published:2007-12-11
  • Contact: ZHOU Yan-tao

摘要: 特征的选择对文本分类的精确性有着非常重要的影响。针对传统的TFIDF没有考虑特征词条在各个类之间的分布的不足,对TFIDF特征选择算法进行了深入的分析,并结合信息熵的概念提出了一种新的TFIDF特征选择算法。实验结果表明,改进后的算法可以有效地提高文本分类的精确度。

关键词: 词条信息熵, 特征选择, TFIDF, 数据挖掘

Abstract: The quality of text feature selection affects the accuracy of text categorization greatly. Due to the deficiency of traditional TFIDF without considering the distribution of feature words among classes,the paper analyzed the TFIDF feature selection algorithm,and proposed a new TFIDF feature selection method with concept of information entropy. Experimental results show the method is valid in improving the accuracy of text categorization.

Key words: words information entropy, feature selection, TFIDF, data mining