Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (10): 140-146.

Previous Articles     Next Articles

Research on term weighting algorithm based on information entropy theory

GUO Hongyu   

  1. North China Institute of Computer Technology, Beijing 100083, China
  • Online:2013-05-15 Published:2013-05-14

基于信息熵理论的特征权重算法研究

郭红钰   

  1. 华北计算技术研究所,北京 100083

Abstract: Text representation is an important process to perform text categorization, and the method of text representation plays an important role in the final classification accuracy. This paper proposes a new term weighting algorithm ETFIDF(Entropy based TFIDF) based on information entropy theory to overcome the limitations of the traditional term weighting algorithm TFIDF(Term Frequency and Inverted Document Frequency). ETFIDF not only considers the number of times a term occurs in a document and the number of documents in training set in which a term occurs, but also takes into account the distribution of documents in the training set in which the term occurs. Experimental results show that ETFIDF outperforms TFIDF in text categorization. Furthermore, detailed theoretical analysis and experimental study on the relationship between ETFIDF and feature selection have been done in this paper. Experimental results show that, it can represent the text more accurately if we take into account the distribution of documents in the training set in which the term occurs in the text representation stage. Moreover, it can achieve higher performance for the combination of ETFIDF and feature selection algorithm if we consider both the accuracy and efficiency.

Key words: information entropy, term weighting, feature selection, text categorization

摘要: 文本表示是使用分类算法处理文本时必不可少的环节,文本表示方法的选择对最终的分类精度起着至关重要的作用。针对经典的特征权重计算方法TFIDF(Term Frequency and Inverted Document Frequency)中存在的不足,提出了一种基于信息熵理论的特征权重算法ETFIDF(Entropy based TFIDF)。ETFIDF不仅考虑特征项在文档中出现的频率及该特征项在训练集中的集中度,而且还考虑该特征项在各个类别中的分散度。实验结果表明,采用ETFIDF计算特征权重可以有效地提高文本分类性能,对ETFIDF与特征选择的关系进行了较详细的理论分析和实验研究。实验结果表明,在文本表示阶段考虑特征与类别的关系可以更为准确地表示文本;如果综合考虑精度与效率两个方面因素,ETFIDF算法与特征选择算法一起采用能够得到更好的分类效果。

关键词: 信息熵, 特征权重, 特征选择, 文本分类