Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE

doi:10.3778/j.issn.1002-8331.2011-0215

Abstract

Abstract:

A new method for imbalanced data sets on feature weighting and clustering ensembles is proposed(WKMeans-SMOTE), which aims to solve the problem of synthesizing all the minority samples without any guidance in SMOTE method. Firstly, considering the different degree of impact of different feature weights on the clustering results, a new clustering algorithm with different feature weights is selected. The initial cluster center is changed many times to generate different clustering results.Then,clustering results are aligned based on the idea of matching clusters algorithm,and the cluster boundary minority samples are picked by introducing clustering consistency index. Finally, the SMOTE method is used on those picked minority samples, and CART algorithm is used as the base classifier to train the balanced dataset.The experimental results show that the method achieves better classifying quality on F-value and G-mean compared with SMOTE, Borderline-SMOTE, ADASYN and other oversampling methods.

Key words: imbalanced data classification, clustering ensembles, feature weighting, clustering consistency index, clusters matching, over-sampling

摘要：

针对SMOTE方法对所有少数类样本进行过采样的缺陷，提出一种基于特征加权与聚类融合的过采样方法（WKMeans-SMOTE），由此进行不平衡数据分类。考虑到不同特征权重对聚类结果的影响程度不同，选择特征加权的聚类算法对原始数据集进行聚类，并多次改变初始簇中心生成不同的聚类结果；根据簇标签匹配方法将不同的聚类结果进行匹配，引进“聚类一致性系数”筛选出处于少数类边界的样本；对筛选出的少数类样本进行SMOTE过采样，并采用CART决策树方法作为基分类器，对新的少数类样本与所有的多数类样本进行训练。实验结果表明，与现有的SMOTE、Borderline-SMOTE和ADASYN等过采样方法相比，所提出的WKMeans-SMOTE方法在分类性能上有一定的提升。

关键词: 不平衡数据分类, 聚类融合, 特征权重, 聚类一致性系数, 簇匹配, 过采样

CHEN Junfeng, ZHENG Zhongtuan. Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE[J]. Computer Engineering and Applications, 2021, 57(23): 106-112.

陈俊丰，郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.

References

[1] HE H，GARCIA E A.Learning from imbalanced data[J].IEEE Transactions on Knowledge and Data Engineering，2009，21（9）：1263-1284.
[2] PHILIP K，CHAN S J S.Toward scalable learning with non-uniform class and cost distributions：a case study in credit card fraud detection[C]//Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining，1998：164-168.
[3] BATISTA G E，PRATI R C，MONARD M C.A study of the behavior of several methods for balancing machine learning training data[J].ACM SIGKDD Explorations Newsletter，2004，6（1）：20-29.
[4] CHAWLA N V，BOWYER K W，HALL L O，et al.SMOTE：synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research，2002，16：321-357.
[5] 石洪波，陈雨文，陈鑫.SMOTE过采样及其改进算法研究综述[J].智能系统学报，2019，14（6）：1073-1083.
SHI H B，CHEN Y W，CHEN X.Summary of research on SMOTE over sampling and its improved algorithms[J].CAAI Transactions on Intelligence Systems，2019，14（6）：1073-1083.
[6] HAN H，WANG W Y，MAO B H.Borderline-SMOTE：a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Berlin，Heidelberg：Springer，2005：878-887.
[7] HE H，BAI Y，GARCIA E A，et al.Adaptive synthetic sampling approach for imbalanced learning[C]//IEEE International Joint Conference on Neural Networks，2008.
[8] CHAWLA N V，LAZAREVIC A，HALL L O，et al.SMOTEBoost：improving prediction of the minority class in boosting[C]//European Conference on Principles of Data Mining and Knowledge Discovery.Berlin，Heidelberg：Springer，2003：107-119.
[9] GALAR M，FERNANDEZ A，BARRENECHEA E，et al.A review on ensembles for the class imbalance problem：bagging-，boosting-，and hybrid-based approaches[J].IEEE Transactions on Systems，Man，and Cybernetics，Part C：Applications and Reviews，2011，42（4）：463-484.
[10] SEIFFERT C，KHOSHGOFTAAR T M，VAN HULSE J，et al.RUSBoost：a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems，Man，and Cybernetics，Part A：Systems and Humans，2009，40（1）：185-197.
[11] MCCARTHY K，ZABAR B，WEISS G.Does cost-sensitive learning beat sampling for classifying rare classes?[C]//Proceedings of the 1st International Workshop on Utility-based Data Mining，2005：69-77.
[12] MacQueen J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability，1967：281-297.
[13] HUANG J Z，NG M K，RONG H，et al.Automated variable weighting in k-means type clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2005，27（5）：657-668.
[14] JING L，NG M K，HUANG J Z.An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data[J].IEEE Transactions on Knowledge and Data Engineering，2007，19（8）：1026-1041.
[15] FRED A.Finding consistent clusters in data partitions[C]//International Workshop on Multiple Classifier Systems.Berlin，Heidelberg：Springer，2001：309-318.
[16] FRED A L N，JAIN A K.Data clustering using evidence accumulation[C]//2002 International Conference on Pattern Recognition，2002.
[17] STREHL A，GHOSH J.Cluster ensembles—a knowledge reuse framework for combining multiple partitions[J].Journal of Machine Learning Research，2002，3：583-617.
[18] TOPCHY A，MINAEI-BIDGOLI B，JAIN A K，et al.Adaptive clustering ensembles[C]//Proceedings of the 17th International Conference on Pattern Recognition，2004：272-275.
[19] 陈思，郭躬德，陈黎飞.基于聚类融合的不平衡数据分类方法[J].模式识别与人工智能，2010，23（6）：772-780.
CHEN S，GUO G D，CHEN L F.Clustering ensembles based classification method imbalanced data sets[J].Pattern Recognition and Artificial Intelligence，2010，23（6）：772-780.
[20] ZHOU Z H，TANG W.Clusterer ensemble[J].Knowledge-Based Systems，2006，19（1）：77-83.
[21] SU C T，CHEN L S，YIH Y.Knowledge acquisition through information granulation for imbalanced data[J].Expert Systems with Applications，2006，31（3）：531-541.
[22] ZHOU L，LAI K K.Benchmarking binary classification models on data sets with different degrees of imbalance[J].Frontiers of Computer Science in China，2009，3（2）：205-216.