Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (23): 106-112.DOI: 10.3778/j.issn.1002-8331.2011-0215

• Big Data and Cloud Computing • Previous Articles     Next Articles

Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE

CHEN Junfeng, ZHENG Zhongtuan   

  1. School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai 201620, China
  • Online:2021-12-01 Published:2021-12-02

WKMeans与SMOTE结合的不平衡数据过采样方法

陈俊丰,郑中团   

  1. 上海工程技术大学 数理与统计学院,上海 201620

Abstract:

A new method for imbalanced data sets on feature weighting and clustering ensembles is proposed(WKMeans-SMOTE), which aims to solve the problem of synthesizing all the minority samples without any guidance in SMOTE method. Firstly, considering the different degree of impact of different feature weights on the clustering results, a new clustering algorithm with different feature weights is selected. The initial cluster center is changed many times to generate different clustering results.Then,clustering results are aligned based on the idea of matching clusters algorithm,and the cluster boundary minority samples are picked by introducing clustering consistency index. Finally, the SMOTE method is used on those picked minority samples, and CART algorithm is used as the base classifier to train the balanced dataset.The experimental results show that the method achieves better classifying quality on F-value and G-mean compared with SMOTE, Borderline-SMOTE, ADASYN and other oversampling methods.

Key words: imbalanced data classification, clustering ensembles, feature weighting, clustering consistency index, clusters matching, over-sampling

摘要:

针对SMOTE方法对所有少数类样本进行过采样的缺陷,提出一种基于特征加权与聚类融合的过采样方法(WKMeans-SMOTE),由此进行不平衡数据分类。考虑到不同特征权重对聚类结果的影响程度不同,选择特征加权的聚类算法对原始数据集进行聚类,并多次改变初始簇中心生成不同的聚类结果;根据簇标签匹配方法将不同的聚类结果进行匹配,引进“聚类一致性系数”筛选出处于少数类边界的样本;对筛选出的少数类样本进行SMOTE过采样,并采用CART决策树方法作为基分类器,对新的少数类样本与所有的多数类样本进行训练。实验结果表明,与现有的SMOTE、Borderline-SMOTE和ADASYN等过采样方法相比,所提出的WKMeans-SMOTE方法在分类性能上有一定的提升。

关键词: 不平衡数据分类, 聚类融合, 特征权重, 聚类一致性系数, 簇匹配, 过采样