计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (21): 133-140.DOI: 10.3778/j.issn.1002-8331.1707-0329

• 模式识别与人工智能 • 上一篇    下一篇

一种结合改进CHI和RFFS的特征选择算法研究

邱宁佳,周  稳,王  鹏,陶  跃   

  1. 长春理工大学 计算机科学技术学院,长春 130022
  • 出版日期:2018-11-01 发布日期:2018-10-30

Research on feature selection algorithm combined with improved CHI and RFFS

QIU Ningjia, ZHOU Wen, WANG Peng, TAO Yue   

  1. College of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China
  • Online:2018-11-01 Published:2018-10-30

摘要: 针对传统CHI算法忽略特征词的词频易导致重要特征词被漏选的问题,结合特征选择时Filter类算法速度快、Wrapper类算法准确率高的特点,提出一种将改进CHI(TDF-CHI)算法与随机森林特征选择(RFFS)相结合的特征选择算法。先利用TDF-CHI算法计算特征词的文档频率及词频与类别的相关程度来进行特征选择,去除冗余特征;再通过RFFS算法度量剩余特征的重要性,进行二次特征选择,优化特征集合,使分类器的性能进一步提升。为了验证改进算法的优越性,利用新闻文本数据,在常用的分类器上进行测试。实验表明,改进算法相比传统CHI算法所选特征词具有更好的分类效果,提高了分类器的准确率和召回率。

关键词: 特征选择, TDF-CHI, 随机森林特征选择(RFFS), 文本分类

Abstract: The traditional CHI algorithm ignores the term frequency of the characteristic word, and it is easy to lead to the leaking of the important feature words. Fast speed of Filter algorithm and high accuracy of Wrapper algorithm are combined in Feature selection. A feature selection algorithm that combined improved CHI(TDF-CHI) with Random Forest Feature Selection(RFFS) are proposed. Firstly, TDF-CHI is used to select the feature and remove the redundant features, by calculating the correlation between the document frequency and the category, the correlation between the term frequency and the category. And then use the RFFS algorithm to measure the importance of the remaining features to carry out the second feature selection, optimize the feature set, so that the performance of the classifier is further improved. In order to verify the superiority of the improved algorithm, it is tested on news text data which is the commonly used data in classifier algorithms. The experiments show that the improved algorithm, which can improve the accuracy and recall rate of the classifier, has better classification effect compared with the traditional CHI algorithm.

Key words: feature selection, TDF-CHI, Random Forest Feature Selection(RFFS), text classification