Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (2): 158-169.DOI: 10.3778/j.issn.1002-8331.2403-0478

• Theory, Research and Development • Previous Articles     Next Articles

Multi-Stage Hybrid Feature Selection Algorithm for Imbalanced Medical Data

LIU Jiaxuan, LI Daiwei, REN Lijuan, ZHANG Haiqing, CHEN Jinjing, YANG Rui   

  1. 1.College of Software Engineering,  Chengdu University of Information Technology,  Chengdu 610225,  China
    2.Sichuan Province Engineering Technology Research Center of Support Software of Informatization Application, Chengdu 610225, China
  • Online:2025-01-15 Published:2025-01-15

面向不平衡医疗数据的多阶段混合特征选择算法

刘佳璇,李代伟,任李娟,张海清,陈金京,杨瑞   

  1. 1.成都信息工程大学 软件工程学院,成都 610225
    2.四川省信息化应用支撑软件工程技术研究中心,成都 610225

Abstract: To solve the problems of high-dimensional feature and class imbalance in medical data, a multi-stage hybrid feature selection algorithm HFSIM (hybrid feature selection for imbalanced medical data) is proposed for imbalanced medical data based on the simple, fast, and effective high-dimensional feature selection algorithm SFE (simple, fast and effective high-dimensional feature selection). HFSIM algorithm adopts the improved adaptive boundary SMOTE oversampling technique to generate new minority class instances that meet the boundary conditions in order to solve the problem of class imbalance in medical data.  Meanwhile, in order to solve the problem of lack of diversity in the search space, the non-selected operator rate parameter UR (unselected rate) in the SFE algorithm is optimized, which effectively avoids the problems of the algorithm’s premature convergence and its tendency to fall into the local optimum. Finally, the filtered Fisher Score method is effectively combined with the optimized UR parameter, which can obtain a better optimization capability at a lower computational cost. After experimental verification, compared with the SFE algorithm, the HFSIM algorithm achieves an accuracy of 99.67% on the Ovarian dataset, an improvement of 2.11 percentage points, and the G-means and F1 are improved by 5.13 and 2.30 percentage points respectively.  In addition, by comparing the number of features and the running time, it is proved that the HFSIM algorithm guarantees the accuracy of the algorithm and reduces the computational cost.

Key words: high dimensional imbalance, feature selection, multi-stage mixing, medical data

摘要: 为解决医疗数据中存在的特征高维和类别不平衡问题,在基于简单、快速和有效高维特征选择算法SFE(simple, fast and effective high-dimensional feature selection)的基础上,提出了一种面向不平衡医疗数据的多阶段混合特征选择算法HFSIM(hybrid feature selection for imbalanced medical data)。HFSIM算法采用改进的自适应边界SMOTE过采样技术,生成符合边界条件的新少数类实例以解决医学数据中类不平衡问题。同时,为了改善搜索空间多样性不足的问题,优化了SFE算法中的非选择操作符率参数UR(unselected rate),有效避免了算法过早收敛及易陷入局部最优的问题。将过滤式Fisher Score方法与优化UR参数后的算法有效结合,使算法能以较低的计算成本获得较好寻优能力。经实验验证,相比于SFE算法,HFSIM算法在Ovarian数据集上准确率达到99.67%,提升了2.11个百分点,G-means和F1分别提升了5.13和2.30个百分点。此外,通过对比特征数量和运行时间,证明了HFSIM算法既能保证算法精度又有效降低了计算成本。

关键词: 高维不平衡, 特征选择, 多阶段混合, 医疗数据