Oversampling Method for Imbalanced Data Using Credible Counterfactual

doi:10.3778/j.issn.1002-8331.2211-0413

Abstract

Abstract: A new method for imbalanced data sets on counterfactual is proposed (counterfactual，CF), and further removes the “incredibility” composite samples, which aims to solve the problem of the traditional sampling method that cannot make full use of the data set information. Its core idea is to synthesize new samples based on the original instance features of the dataset. Compared with the traditional oversampling interpolation method, it can fully mine the boundary decision information in the data, so as to provide more useful information for the classifier and improve the classification performance. A lot of comparative experiments have been carried out on 9 KEEL and UCI unbalanced datasets, 5 different classifiers (SVM, DT, Logistic, RF, AdaBoost) and 4 traditional oversampling methods (SMOTE, B1-SMOTE, B2-SMOTE, ADASYN). The results show that the algorithm has higher AUC value、F1 value and G-mean value, which can effectively solve the class imbalance problem.

Key words: imbalanced data, classifiers, oversampling, counterfactual (CF)

摘要： 针对传统过采样方法不能充分利用数据集信息的缺陷，提出一种基于反事实（counterfactual，CF）的不平衡数据过采样方法，并进一步对生成的少数类合成样本进行了“可信”清除。其核心思想是依据数据集原有实例特征值合成新样本，相比传统过采样的插值法，更能充分挖掘数据中的边界决策信息，从而为分类器提供更多的有用信息，提高分类性能。在9个来自KEEL与UCI的不平衡数据集、5种不同分类器（SVM、DT、Logistic、RF、AdaBoost）上与4种传统过采样方法（SMOTE、B1-SMOTE、B2-SMOTE、ADASYN）进行了大量对比实验，结果表明，所提方法具有更高的AUC值、F1值和G-mean值，可以更为有效地解决类不平衡问题。

关键词: 不平衡数据集, 分类器, 过采样, 反事实（CF）

GAO Feng, SONG Mei, ZHU Yi. Oversampling Method for Imbalanced Data Using Credible Counterfactual[J]. Computer Engineering and Applications, 2024, 60(5): 165-171.

高峰, 宋媚, 祝义. 利用可信反事实的不平衡数据过采样方法[J]. 计算机工程与应用, 2024, 60(5): 165-171.

References

[1] LI J, FONG S, WONG R K, et al. Adaptive multi-objective swarm fusion for imbalanced data classification[J]. Information Fusion, 2018, 39: 1-24.
[2] LIU T, FAN W, WU C. A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset[J]. Artificial Intelligence in Medicine, 2019, 101: 101723.
[3] TANTITHAMTHAVORN C, HASSAN A E, MATSUM-OTO K. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models[J]. IEEE Transactions on Software Engineering, 2018, 46(11): 1200-1219.
[4] LI Z, HUANG M, LIU G, et al. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection[J]. Expert Systems with Applications, 2021, 175: 114750.
[5] 高雷阜, 张梦瑶, 赵世杰. 融合簇边界移动与自适应合成的混合采样算法[J]. 电子学报, 2022, 50(10): 2517-2529.
GAO L F, ZHANG M Y, ZHAO S J. Mixed-sampling algorithm combining cluster boundary movement and adaptive acta electronica sinica[J]. Acta Electronica Sinica, 2022, 50(10): 2517-2529.
[6] ZHU T, LIN Y, LIU Y. Improving interpolation-based over-
sampling for imbalanced data learning[J]. Knowledge-Based Systems, 2020, 187: 104826.
[7] 胡峰, 王蕾, 周耀. 基于三支决策的不平衡数据过采样方法[J]. 电子学报, 2018, 46(1): 135-144.
HU F, WANG L, ZHOU Y. An oversampling method for imbalance data based on three-way decision model[J]. Acta Electronica Sinica, 2018, 46(1): 135-144.
[8] 崔鑫, 徐华, 朱亮. 面向不均衡数据的多分类集成算法[J]. 计算机工程与应用, 2022, 58(2): 176-183.
CUI X, XU H, ZHU L. Multi-classification ensemble algorithm for imbalanced data[J]. Computer Engineering and Applications, 2022, 58(2): 176-183.
[9] 刘宁, 朱波, 阴艳超, 等. 一种混合CGAN与SMOTEENN的不平衡数据处理方法[J]. 控制与决策, 2023, 38(9): 2614-2621.
LIU N, ZHU B, YIN Y C, et al. An imbalanced data processing method based on hybrid CGAN and SMOTEENN[J]. Control and Decision, 2023, 38(9): 2614-2621.
[10] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[11] GOODFELLOW I J, POUGET A J, MIRZA M, et al. Generative adversarial networks[C]//Advances in Neural Information Processing Systems, 2014: 2672-2680.
[12] FIORE U, DE SANTIS A, PERLA F, et al. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection[J]. Information Sciences, 2017, 479: 448-455.
[13] 陈俊丰, 郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.
CHEN J F, ZHENG Z T. Over-sampling method on imbalanced data based on WKMeans and SMOTE[J]. Computer Engineering and Applications, 2021, 57(23): 106-112.
[14] WACHTER S, MITTELSTADT B, RUSSELL C. Counterfactual explanations without opening the black box: automated decisions and the GDPR[J]. Harvard Journal of Law & Technology, 2017, 31: 841.
[15] MOTHILAL R K, SHARMA A, TAN C. Explaining machine learning classifiers through diverse counterfactual explanations[C]//Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, January 27-30, 2020. New York: ACM, 2020: 607-617.
[16] RUSSEL C. Efficient search for diverse coherent explanations[C]//Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency, Atlanta, January 29-31, 2019. New York: ACM, 2019: 20-28.
[17] VAN LOOVEREN A, KLAISE J. Interpretable counterfactual explanations guided by prototypes[J]. arXiv:1907.02584,2019.
[18] KEANE M T, SMYTH B. Good counterfactuals and where to find them: a case-based technique for generating counterfactuals for explainable AI (XAI)[C]//28th International Conference on Case-Based Reasoning Research and Development, 2020: 163-178.
[19] 王明, 武文芳, 王大玲, 等. 生成链接树: 一种高数据真实性的反事实解释生成方法[J]. 计算机科学, 2022, 49(9): 33-40.
WANG M, WU W F, WANG D L, et al. Generative link tree: a counterfactual explanation generation approach with high datafidelity[J]. Computer Science, 2022, 49(9): 33-40.
[20] 马舒岑, 史建琦, 黄滟鸿, 等. 基于最小不满足核的随机森林局部解释性分析[J]. 软件学报, 2022, 33(7): 2447-2463.
MA S C, SHI J Q, HUANG Y H, et al. Minimal-unsatisfiable-core-driven local explainability analysis for random Forest[J]. Journal of Software, 2022, 33(7): 2447-2463.
[21] TEMRAZ M, KENNY E M, Ruelle E, et al. Handling climate change using counterfactuals: using counterfactuals in data augmentation to predict crop growth in an uncertain climate future[C]//29th International Conference on Case-Based Reasoning Research and Development, 2021: 216-231.
[22] 夏子芳, 于亚新, 王子腾, 等. 融合协同知识图谱与反事实推理的可解释推荐机制[J]. 计算机应用, 2023, 43(7): 2001-2009.
XIA Z F, YU Y X, WANG Z T, et al. Explainable recommendation mechanism by fusion collaborative knowledge graph and counterfactual[J]. Journal of Computer Applictions, 2023, 43(7): 2001-2009.
[23] TEMRAZ M, KEANE M T. Solving the class imbalance problem using a counterfactual method for data augmentation[J]. Machine Learning with Applications, 2022, 9: 100375.
[24] DELANEY E, GREENE D, KEANE M T. Instance-based counterfactual explanations for time series classification[C]//29th International Conference on Case-Based Reasoning Research and Development, 2021: 32-47.
[25] LAUGEL T, LESOT M J, MARSALA C. The dangers of post-hoc interpretability: unjustified counterfactual explanations[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 2801-2807.
[26] WILSON D R, MARTINEZ T R. Improved heterogeneous distance functions[J]. Journal of Artificial Intelligence Research, 1997, 6: 1-34.
[27] F?RSTER M, KLIER M, KLUGE K, et al. Fostering human agency: a process for the design of user-centric XAI systems[C]//Proceedings of the 41st International Conference on Information Systems, 2020.
[28] TAO X, LI Q, GUO W, et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification[J]. Information Sciences, 2019, 487: 31-56.
[29] GONZALEZ-ABRIL L, NUNEZ H, ANGULO C, et al. GSVM: an SVM for handling imbalanced accuracy between classes inbi-classification problems[J]. Applied Soft Computing, 2014, 17: 23-31.
[30] 吴艺凡, 梁吉业, 王俊红. 基于混合采样的非平衡数据分类算法[J]. 计算机科学与探索, 2019, 13(2): 342-349.
WU Y F, LIANG J Y, WANG J H. Classification algorithm based on hybrid sampling for unbalanced data[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(2): 342-349.
[31] 马汉达, 朱敏. 改进SVM不平衡数据分类的IGWOSMO-TE方法[J]. 计算机工程与科学, 2022, 44(6): 1133-1140.
MA H D, ZHU M. IGWOSMO-TE: an over sampling method based on improved gray wolf algorithm fol SVM imbalanced data classification[J]. Computer Engineering & Science, 2022, 44(6): 1133-1140.
[32] KORKMAZ S, SAHMAN M A, CINAR A C, et al. Boosting the oversampling methods based on differential evolution strategies for imbalanced learning[J]. Applied Soft Computing, 2021, 112: 107787.
[33] 黎旭, 陈家兑, 吴永明, 等. 基于改进SMOTE的制造过程不平衡数据分类策略[J]. 计算机工程与应用, 2022, 58(16): 284-291.
LI X, CHEN J D, WU Y M, et al. Classification strategy of imbalanced data in manufacturing process based on improved SMOTE[J]. Computer Engineering and Applications, 2022, 58(16): 284-291.
[35] 王泳欣, 张大斌, 车大庆, 等. 面向不平衡数据集分类的LDBSMOTE过采样方法[J]. 统计与决策, 2022, 38(18): 58-63.
WANG Y X, ZHANG D B, CHE D Q, et al. LDBSMOTE oversampling method for imbalanced data sets classification[J]. Statistics & Decision, 2022, 38(18): 58-63.