Imbalanced Classification Method Based on Cross-Class Sample Migration Framework

doi:10.3778/j.issn.1002-8331.2305-0191

Abstract

Abstract: For imbalanced classification problems, achieving a balance in the number and distribution of samples in overlapping region is the key to alleviate the subsequent decision bias. Existing imbalanced classification methods often generate new samples only from minority samples to balance the number of different class samples, but do not make full use of the rich information of majority samples. Especially when the absolute number of minority samples is too small, only using the original minority sample information cannot effectively balance the distribution of samples in overlapping regions. An imbalanced classification method based on cross-class sample migration framework is proposed. Firstly, a mapping network constructed by the fully connected layer is embedded in the variational autoencoder (VAE) hidden code sampling process. By fully learning the commonality and characteristics of different classes of samples, the hidden code of majority samples is mapped and transformed under the influence of hidden coding prior constraints and cross-domain consistency constraints. This makes the hidden codes before and after conversion share the same distribution space, and enables the decoder in VAE to migrate majority samples to minority samples. At the same time, a generative confrontation mechanism is introduced to discriminate the original sample and the new sample, as well as the hidden codes before and after conversion, to further improve the reliability of the migrated sample. Furthermore, the distances between the newly generated samples and the original samples of different categories are weighted, and the samples closer to the overlapping region are obtained by screening, so that the number and distribution of different types of samples in the overlapping region are more balanced. Experimental results on 16 public datasets show that the proposed method is significantly superior to 10 typical imbalanced classification methods in F1 measure and G-mean. Especially in 11 public datasets with high imbalance ratio and small absolute number of minority samples, the performance improvement of the proposed method is more significant.

Key words: imbalanced classification, cross-class sample migration framework, variational autoencoders, mapping network, generative countermeasure mechanism, weighted Euclidean distance constraint

摘要： 对于不平衡分类问题，实现类别交叠区域中样本数目和分布的平衡是缓解后续决策偏移的关键，而现有的不平衡分类方法往往只从少数类样本生成新样本来达到样本数目的平衡，没有充分利用多数类样本丰富的信息。特别是在少数类样本绝对数量过少的情况下，仅利用原始少数类样本信息无法有效平衡交叠区域样本的分布。提出了一种跨类别样本迁移框架下的不平衡分类方法。在变分自编码器（variational autoencoder，VAE）隐编码采样过程中嵌入由全连接层构建的映射网络，在VAE充分学习不同类别样本的共性和特性的基础上，在隐编码先验约束和跨域一致性约束下对多数类样本的隐编码进行映射转换，使转换前后隐编码共享相同的分布空间，并通过VAE中解码器实现多数类样本向少数类样本的迁移。同时融入生成对抗机制，对原始样本和新样本以及转换前后的隐编码进行判别对抗，进一步提升迁移样本的可靠性。在此基础上，分别对新生成样本与原始不同类别样本的距离进行加权约束，并筛选得到更加靠近交叠区域的样本，使该区域不同类别样本的数目和分布更加平衡。在16个公共数据集上的实验结果表明，在F1测量值和G-均值上该方法显著优于10种典型的不平衡分类方法，特别是在11个不平衡比例较高、少数类样本绝对数量过少的公共数据集中，该方法性能提升更加显著。

关键词: 不平衡分类, 跨类别样本迁移框架, 变分自编码器, 映射网络, 生成对抗机制, 加权欧式距离约束

YU Haibo, LIU Jing, LI Qiangwei, GAO Xin, TAN Huang, CHEN Tianyang. Imbalanced Classification Method Based on Cross-Class Sample Migration Framework[J]. Computer Engineering and Applications, 2024, 60(16): 143-158.

于海波, 刘婧, 李强伟, 高欣, 谭煌, 陈天阳. 跨类别样本迁移框架下的不平衡分类方法[J]. 计算机工程与应用, 2024, 60(16): 143-158.

References

[1] ZHANG J, ZHANG K, AN Y, et al. An integrated multitasking intelligent bearing fault diagnosis scheme based on representation learning under imbalanced sample condition[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(1): 1-12.
[2] DEPTO D S, RIZVEE M M, RAHMAN A, et al. Quantifying imbalanced classification methods for leukemia detection[J]. Computers in Biology and Medicine, 2023, 152: 106372.
[3] ZHU H, ZHOU M C, LIU G, et al. NUS: noisy-sample-removed undersampling scheme for imbalanced classification and application to credit card fraud detection[J]. IEEE Transactions on Computational Social Systems, 2023, 53(5): 624-629.
[4] 王乐, 韩萌, 李小娟, 等. 不平衡数据集分类方法综述[J]. 计算机工程与应用, 2021, 57(22): 42-52.
WANG L, HANG M, LI X J, et al. Review of classification methods for unbalanced data sets[J]. Computer Engineering and Applications, 2021, 57(22): 42-52.
[5] TAO H, YUN L, KE W, et al. A new weighted SVDD algorithm for outlier detection[C]//Proceedings of the 2016 Chinese Control and Decision Conference, Yinchuan, May 28-30, 2016. Piscataway: IEEE, 2016: 5456-5461.
[6] ZHAO X, WU Y, LEE D L, et al. iForest: interpreting random forests via visual analytics[J]. IEEE Transactions on Visualization and Computer Graphics, 2018, 25(1): 407-416.
[7] YANG Y, HUANG S, HUANG W, et al. Privacy-preserving cost-sensitive learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 32(5): 2105-2116.
[8] SVETNIK V, LIAW A, TONG C, et al. Random forest: a classification and regression tool for compound classification and QSAR modeling[J]. Journal of Chemical Information and Modeling, 2003, 43(6): 1947-1958.
[9] FRIEDMAN J H. Greedy function approximation: a gradient boosting machine[J]. Annals of Statistics, 2001, 29(5): 1189-1232.
[10] CHEN T, GUESTRIN C. XGBoost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, Aug 13-17, 2016. New York: ACM, 2016: 785-794.
[11] KRAWCZYK B, GALAR M, JELEN ?, et al. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy[J]. Applied Soft Computing, 2016, 38: 714-726.
[12] YEN S J, LEE Y S. Cluster-based under-sampling approaches for imbalanced data distributions[J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.
[13] DAI Q, LIU J, YANG J P. SWSEL: sliding window-based selective ensemble learning for class-imbalance problems[J]. Engineering Applications of Artificial Intelligence, 2023, 121: 105959.
[14] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
[15] SANDHAN T, CHOI J Y. Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition[C]//Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Aug 24-28, 2014. Piscataway: IEEE, 2014: 1449-1453.
[16] DOUZAS G, BACAO F, LAST F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE[J]. Information Sciences, 2018, 465: 1-20.
[17] LI J, ZHU Q, WU Q, et al. SMOTE-NaN-DE: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution[J]. Knowledge-Based Systems, 2021, 223: 107056.
[18] WEI Z, ZHANG L, ZHAO L. Minority-prediction-probability-based oversampling technique for imbalanced learning[J]. Information Sciences, 2023, 622: 1273-1295.
[19] VUTTIPITTAYAMONGKOL P, ELYAN E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data[J]. Information Sciences, 2020, 509: 47-70.
[20] KINGMA D P, WELLING M. Auto-encoding variational Bayes[C]//Proceedings of the 2nd International Conference on Learning Representations, Banff, Apr 14-16, 2014.
[21] CRESWELL A, WHITE T, DUMOULIN V, et al. Generative adversarial networks: an overview[J]. IEEE Signal Processing Magazine, 2018, 35(1): 53-65.
[22] LARSEN A B L, SONDERBY S K, LAROCHELLE H, et al. Autoencoding beyond pixels using a learned similarity metric[C]//Proceedings of the 33rd International Conference on Machine Learning, New York, 2016, 48: 1558-1566.
[23] ZHENG M, LI T, ZHU R, et al. Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification[J]. Information Sciences, 2020, 512: 1009-1023.
[24] HUANG K, WANG X. ADA-INCVAE: improved data generation using variational autoencoder for imbalanced classification[J]. Applied Intelligence, 2022, 52(3): 2838-2853.
[25] DING H, SUN Y, HUANG N, et al. RVGAN-TL: a generative adversarial networks and transfer learning-based hybrid approach for imbalanced data classification[J]. Information Sciences, 2023, 629: 184-203.
[26] AI Q, WANG P, HE L, et al. Generative oversampling for imbalanced data via majority-guided VAE[C]//Proceedings of the 2023 International Conference on Artificial Intelligence and Statistics, 2023: 3315-3330.
[27] WANG S, LUO H, HUANG S, et al. Counterfactual-based minority oversampling for imbalanced classification[J]. Engineering Applications of Artificial Intelligence, 2023, 122: 106024.
[28] STURDIVANT R X. Applied logistic regression[J]. Technometrics, 2013, 34(3): 358-359.
[29] JANIK P, LOBOS T. Automated classification of power-quality disturbances using SVM and RBF networks[J]. IEEE Transactions on Power Delivery, 2006, 21(3): 1663-1669.
[30] SVETNIK V, LIAW A, TONG C, et al. Random forest: a classification and regression tool for compound classification and QSAR modeling[J]. Journal of Chemical Information and Computer Sciences, 2003, 43(6): 1947-1958.
[31] GARCIA S, FERNANDEZ A, LUENGO J, et al. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power[J]. Information Sciences, 2010, 180(10): 2044-2064.
[32] TAHERI S M, HESAMIAN G. A generalization of the Wilcoxon signed-rank test and its applications[J]. Statistical Papers, 2013, 54(2): 457-470.
[33] PEREIRA D G, AFONSO A, MEDEIROS F M. Overview of Friedman’s test and post-hoc analysis[J]. Communications in Statistics-Simulation and Computation, 2015, 44(10): 2636-2653.
[34] PEDREGOSA F, VAROQUAUX G, GRAMFORT A, et al. Scikit-learn: machine learning in Python[J]. The Journal of Machine Learning research, 2011, 12: 2825-2830.