计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (14): 74-85.DOI: 10.3778/j.issn.1002-8331.2306-0248

• 理论与研发 • 上一篇    下一篇

基于自适应邻域与聚类的非平衡数据特征选择

孙林,梁娜,王欣雅   

  1. 1.天津科技大学 人工智能学院,天津 300457
    2.河南师范大学 计算机与信息工程学院,河南 新乡 453007
    3.河南中豫建设投资集团股份有限公司,郑州 450000
  • 出版日期:2024-07-15 发布日期:2024-07-15

Feature Selection Using Adaptive Neighborhood and Clustering for Imbalanced Data

SUN Lin, LIANG Na, WANG Xinya   

  1. 1.College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin 300457, China
    2.College of Computer and Information Engineering, Henan Normal University, Xinxiang, Henan 453007, China
    3.Henan Zhongyu Construction Investment Group Company Ltd., Zhengzhou 450000, China
  • Online:2024-07-15 Published:2024-07-15

摘要: 为了解决传统邻域粗糙集未考虑不平衡数据的类分布,多数邻域系统通过人工调试难以找到最佳邻域半径,以及聚类时指定簇的数目等问题,提出一种基于自适应邻域与聚类的非平衡数据特征选择方法。根据样本在各个特征下与其他样本距离的平均值来确定样本的自适应[k]近邻和共享近邻,定义自适应邻域密度并设计混合采样模型,构建平衡决策系统。基于特征分布定义新的邻域半径,使用高斯核函数研究邻域内样本之间的模糊相似关系,使用模糊邻域互信息度量特征间的相关性,基于此对特征进行聚类。基于模糊邻域互信息构造粒子群初始化策略,并引入动态位掩码策略与适合整数编码的差异性扰动算子,改进整型粒子群优化算法,实现从特征簇中选出代表性特征构成最终的特征子集。在19个非平衡数据集的实验结果表明所设计的算法有效地提高了非平衡数据的分类性能。

关键词: 自适应邻域, 混合采样, 模糊邻域互信息, 特征聚类, 特征选择

Abstract: To solve the problems that the traditional neighborhood rough sets do not consider the class-distribution of imbalanced data, and it is difficult for most neighborhood systems to find the optimal neighborhood radius through manual debugging and the number of clusters needs to be specified in clustering, a feature selection method for imbalanced data based on adaptive neighborhood and clustering is proposed. Firstly, the adaptive K-nearest neighbors and shared nearest neighbors of samples are determined according to the average distance between the samples and other samples under each feature, and then the hybrid sampling model is designed based on adaptive neighborhood density to develop the balanced decision systems. Secondly, a new neighborhood radius is defined based on the feature distribution, the Gaussian kernel function is used to research the fuzzy similarity relationship between samples in the neighborhood. The fuzzy neighborhood mutual information is proposed to measure the correlation between features, and features are clustered based on this. Finally, the particle swarm initialization strategy is designed based on fuzzy neighborhood mutual information. To improve the integer particle swarm optimization algorithm, the dynamic bit mask strategy and the differential perturbation operator suitable for integer coding are introduced, and the representative features are selected from the feature cluster to form the final feature subset. The experimental results on 19 imbalanced datasets show that the developed algorithm can effectively improve the classification effect of imbalanced data.

Key words: adaptive neighborhood, hybrid sampling, fuzzy neighborhood mutual information, feature clustering, feature selection