计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (11): 95-102.DOI: 10.3778/j.issn.1002-8331.2005-0101

• 大数据与云计算 • 上一篇    下一篇

混合Filter与改进自适应GA的特征选择方法

邱云飞,高华聪   

  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125100
  • 出版日期:2021-06-01 发布日期:2021-05-31

Hybrid Filter and Improved Adaptive GA for Feature Selection

QIU Yunfei, GAO Huacong   

  1. School of Software, Liaoning Technical University, Huludao, Liaoning 125100, China
  • Online:2021-06-01 Published:2021-05-31

摘要:

针对高维度小样本数据在特征选择时出现的维数灾难和过拟合的问题,提出一种混合Filter模式与Wrapper模式的特征选择方法(ReFS-AGA)。该方法结合ReliefF算法和归一化互信息,评估特征的相关性并快速筛选重要特征;采用改进的自适应遗传算法,引入最优策略平衡特征多样性,同时以最小化特征数和最大化分类精度为目标,选择特征数作为调节项设计新的评价函数,在迭代进化过程中高效获得最优特征子集。在基因表达数据上利用不同分类算法对简化后的特征子集分类识别,实验结果表明,该方法有效消除了不相关特征,提高了特征选择的效率,与ReliefF算法和二阶段特征选择算法mRMR-GA相比,在取得最小特征子集维度的同时平均分类准确率分别提高了11.18个百分点和4.04个百分点。

关键词: 特征选择, Filter模式, ReliefF算法, 归一化互信息, 自适应遗传算法

Abstract:

Aiming at the problem of dimension disaster and over fitting in feature selection of high dimension small sample data, this paper proposes a feature selection method(ReFS-AGA) based on mixed Filter mode and Wrapper mode. Firstly, the ReliefF algorithm and normalized mutual information are combined to evaluate the correlation of features and quickly select important features. Then, an improved adaptive genetic algorithm is used to balance the diversity of features. At the same time, the objective is to minimize the number of features and maximize the classification accuracy, and the number of features is selected as the adjusting item to design a new evaluation function, which efficiently obtains the optimal feature subset in the iterative evolution process. In this paper, different classification algorithms are used to classify and recognize the simplified feature subset on the gene expression data. The experimental result shows that this method effectively eliminates the irrelevant features and improves the efficiency of feature selection. Compared with the ReliefF algorithm and the two-stage feature selection algorithm mRMR-GA, the average classification accuracy is improved by 11.18 percentage points and 4.04 percentage points respectively when the minimum feature subset dimension is obtained.

Key words: feature selection, Filter mode, ReliefF algorithm, normalized mutual information, adaptive genetic algorithm