计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (17): 147-156.DOI: 10.3778/j.issn.1002-8331.2008-0171

• 模式识别与人工智能 • 上一篇    下一篇

融合分类信息的随机森林特征选择算法及应用

武炜杰,张景祥   

  1. 江南大学 理学院,江苏 无锡 214122
  • 出版日期:2021-09-01 发布日期:2021-08-30

Random Forest Feature Selection Algorithm Based on Categorization Information and Application

WU Weijie, ZHANG Jingxiang   

  1. School of Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2021-09-01 Published:2021-08-30

摘要:

针对传统随机森林随特征数增加计算消耗高的问题,提出了一种随机森林多特征置换算法。该算法对数据特征进行聚类,保持其他特征簇不变,逐一对同簇特征同时随机置换,得到全部特征簇的重要性得分及簇间排序。簇内特征按与分类信息的相关程度排序,引入相关性阈值选出重要特征,对剩余特征按先簇间、再簇内的规则进行排序。为了进一步比较该方法的有效性,基于[K]均值聚类、层次聚类、模糊[C]均值聚类算法,设计了三种随机森林多特征置换的特征选择算法。实验结果表明,与传统随机森林方法相比,新算法可选择较少特征时仍取得较高分类精度,且时间效率更高。

关键词: 特征选择, 聚类, 随机森林, 多特征置换

Abstract:

Aiming at the problem of calculating high consumption of traditional random forest with the increase of feature number, a multi-feature permutation algorithm by random forest is proposed. All of features are clustered firstly, then the features in the same cluster are taken random permutation as the other clusters remain unchanged. The importance of all the feature-clusters are calculated and ranked. The feature in the same cluster is ranked by the correlation of itself and classification information. A correlation threshold is used to choose the important features. The rule of ranking the remaining feature is first between clusters, then within clusters. To further illustrate the effectiveness of the method, three correspondingly multi-feature permutation algorithms by random forest are designed based on K-mean, hierarchical and fuzzy C-mean clustering algorithms. The experimental results show that the proposed algorithm achieves higher classification accuracy with fewer features and higher time efficiency compared with the traditional random forest method.

Key words: feature selection, cluster, random forest, multi-feature permutation