Posterior Probability and Density-Based Imbalanced Data Undersampling

doi:10.3778/j.issn.1002-8331.2207-0442

Abstract

Abstract: Undersampling is one of the most popular methods for dealing with class imbalance problem. Existing research shows that efficient class overlap handling can improve the performance of imbalanced oversampling. However, most of the current undersampling researches claim that the loss of key samples due to improper sample selection strategy is the main reason affecting the performance of undersampling methods. Therefore, researchers have proposed a series of methods to select the informative majority samples, but studies on handing class overlap in undersampling are still open. In this paper, an undersampling method based on Bayes posterior probability and distribution density（BPDDUS） is proposed to detect and clean samples in overlapping areas firstly, and it undersamples the remaining samples according to the distribution information of the majority samples. Specifically, the method first cleans the potential noise and overlapping samples in the majority class by Bayes posterior probability to enhance the clarity of the classification decision boundary, the global distribution density and information entropy are introduced to measure the importance of the samples and assign the corresponding sampling weights. Finally, an ensemble classification is constructed to improve the generalization ability of the model. The validity of the proposed BPDDUS method is verified by numerical experiments on 43 KEEL databases.

Key words: imbalanced data, undersampling, Bayes posterior probability, global distribution density, ensemble classification, information entropy

摘要： 欠采样是当前解决类不平衡问题的主流方法之一。现有研究表明，高效地处理类别重叠能够有效提升过采样方法的性能。然而，目前对欠采样的研究大多认为由于样本选择策略不当而导致的关键样本丢失是影响欠采样方法性能的主要原因，为此，研究者从不同的角度提出了一系列针对性的方法，但鲜有对欠采样中类别重叠的研究。提出一种融合贝叶斯后验概率和分布密度的欠采样方法（BPDDUS）实现重叠区域样本的检测和清洗，并通过样本的分布信息对清洗后的样本进行欠采样。具体来说，该方法通过贝叶斯后验概率对多数类样本中潜在的噪声和重叠样本进行清洗以增强分类决策边界的清晰度。对清洗后的多数类样本，引入全局分布密度和信息熵来度量样本对不平衡数据分类学习的重要程度并对其分配相应的采样权重。按样本权重欠采样并构建集成分类系统，以提升模型的泛化能力。在43个KEEL数据库数据集上进行的数值实验验证了所提的BPDDUS方法的有效性。

关键词: 不平衡数据, 欠采样, 贝叶斯后验概率, 全局分布密度, 集成分类, 信息熵

REN Yanping, ZHENG Zhong, JIANG Yifei, YAN Yuanting, ZHANG Yanping. Posterior Probability and Density-Based Imbalanced Data Undersampling[J]. Computer Engineering and Applications, 2022, 58(23): 268-277.

任艳平, 郑重, 江一飞, 严远亭, 张燕平. 融合后验概率和密度的不平衡数据欠采样方法[J]. 计算机工程与应用, 2022, 58(23): 268-277.

References

[1] ANAND A，PUGALENTHI G，FOGEL G B，et al.An approach for classification of highly imbalanced data using weighting and undersampling[J].Amino Acids，2010，39（5）：1385-1391.
[2] JURGOVSKY J，GRANITZER M，ZIEGLER K，et al.Sequence classification for credit-card fraud detection[J].Expert Systems with Applications，2018，100：234-245.
[3] HORTA R A M，DE LIMA B S L P，BORGES C C H.A semi-deterministic ensemble strategy for imbalanced datasets（SDEID） applied to bankruptcy prediction[J].WIT Transactions on Information and Communication Technologies，2008，40：205-213.
[4] SUN A，LIM E P，LIU Y.On strategies for imbalanced text classification using SVM：a comparative study[J].Decision Support Systems，2009，48（1）：191-201.
[5] KUBAT M，HOLTE R C，MATWIN S.Machine learning for the detection of oil spills in satellite radar images[J].Machine Learning，1998，30（2）：195-215.
[6] 严远亭，戴涛，张以文，等.邻域感知的不平衡数据集过采样方法[J].小型微型计算机系统，2021，42（7）：1360-1370.
YAN Y T，DAI T，ZHANG Y W，et al.Neighborhood-aware imbalanced oversampling[J].Journal of Chinese Computer Systems，2021，42（7）：1360-1370.
[7] 董明刚，刘明，敬超.利用采样安全系数的多类不平衡过采样算法[J].计算机科学与探索，2020，14（10）：1776-1786.
DONG M G，LIU M，JING C.Sampling safety coefficient for multi-class imbalance oversampling algorithm[J].Journal of Frontiers of Computer Science and Technology，2020，14（10）：1776-1786.
[8] 严远亭，朱原玮，吴增宝，等.构造性覆盖算法的SMOTE过采样方法[J].计算机科学与探索，2020，14（6）：975-984.
YAN Y T，ZHU Y W，WU Z B，et al.Constructive covering algorithm-based SMOTE over-sampling method[J].Journal of Frontiers of Computer Science and Technology，2020，14（6）：975-984.
[9] ZHOU Z H，LIU X Y.Training cost-sensitive neural networks with methods addressing the class imbalance problem[J].IEEE Transactions on Knowledge and Data Engineering，2005，18（1）：63-77.
[10] SUN Y，KAMEL M S，WONG A K C，et al.Cost-sensitive boosting for classification of imbalanced data[J].Pattern Recognition，2007，40（12）：3358-3378.
[11] SEIFFERT C，KHOSHGOFTAAR T M，VAN HULSE J，et al.RUSBoost：a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems，Man，and Cybernetics-Part A：Systems and Humans，2009，40（1）：185-197.
[12] BARANDELA R，SANCHEZ J S，VALDOVINOS R M.New applications of ensembles of classifiers[J].Pattern Analysis & Applications，2003，6（3）：245-256.
[13] LIU X Y，WU J，ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems Man & Cybernetics Part B，2009，39（2）：539-550.
[14] CHAWLA N V，BOWYER K W，HALL L O，et al.SMOTE：synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research，2002，16（1）：321-357.
[15] HAN H，WANG W Y，MAO B H.Borderline-SMOTE：a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Berlin，Heidelberg：Springer，2005：878-887.
[16] HE H，BAI Y，GARCIA E A，et al.ADASYN：adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks（IEEE World Congress on Computational Intelligence），2008：1322-1328.
[17] YAN Y，JIANG Y，ZHENG Z，et al.LDAS：local density-based adaptive sampling for imbalanced data classification[J].Expert Systems with Applications，2022，191：116213.
[18] YAN Y，ZHU Y，LIU R，et al.Spatial distribution-based imbalanced undersampling[J].IEEE Transactions on Know-ledge and Data Engineering，2022，doi：10.1109/TKDE. 2022.3161537.
[19] HART P.The condensed nearest neighbor rule（corresp.）[J].IEEE Transactions on Information Theory，1968，14（3）：515-516.
[20] LIN W C，TSAI C F，HU Y H，et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences，2017，409：17-26.
[21] FREUND Y，SCHAPIRE R E.A decision-theoretic gene-ralization of on-line learning and an application to boosting[J].Journal of Computer and System Sciences，1997，55（1）：119-139.
[22] KOZIARSKI M.Radial-based undersampling for imba-lanced data classification[J].Pattern Recognition，2020，102：107262.
[23] SMITH M R，MARTINEZ T，GIRAUD-CARRIER C.An instance level analysis of data complexity[J].Machine Learning，2014，95（2）：225-256.
[24] LEE H K，KIM S B.An overlap-sensitive margin classifier for imbalanced and overlapping data[J].Expert Systems with Applications，2018，98：72-83.
[25] VUTTIPITTAYAMONGKOL P，ELYAN E，PETROVSKI A，et al.Overlap-based undersampling for improving imbalanced data classification[C]//International Conference on Intelligent Data Engineering and Automated Learning.Cham：Springer，2018：689-697.
[26] DAS S，DATTA S，CHAUDHURI B B.Handling data irregularities in classification：foundations，trends，and future challenges[J].Pattern Recognition，2018，81：674-693.
[27] STEFANOWSKI J.Overlapping，rare examples and class decomposition in learning classifiers from imbalanced data[M]//Emerging paradigms in machine learning.Berlin，Heidelberg：Springer，2013：277-306.
[28] BUNKHUMPORNPAT C，SINAPIROMSARAN K，LURSINSAP C.Safe-level-smote：safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining.Berlin，Heidelberg：Springer，2009：475-482.
[29] LIANG X W，JIANG A P，LI T，et al.LR-SMOTE—an improved unbalanced data set oversampling based on K-means and SVM[J].Knowledge-Based Systems，2020，196：105845.
[30] WANG Z，WANG H.Global data distribution weighted synthetic oversampling technique for imbalanced learning[J].IEEE Access，2021，9：44770-44783.
[31] SáEZ J A，LUENGO J，STEFANOWSKI J，et al.SMOTE-IPF：addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering[J].Information Sciences，2015，291：184-203.
[32] BATISTA G E A P A，PRATI R C，MONARD M C.A study of the behavior of several methods for balancing machine learning training data[J].ACM SIGKDD Explorations Newsletter，2004，6（1）：20-29.
[33] WILSON D L.Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Transactions on Systems，Man，and Cybernetics，1972（3）：408-421.
[34] IVAN T.Two modifications of CNN[J].IEEE Transactions on Systems，Man and Communications，1976，6：769-772.
[35] MANI I，ZHANG I.kNN approach to unbalanced data distributions：a case study involving information extraction[C]//Proceedings of Workshop on Learning from Imbalanced Datasets，2003：1-7.
[36] GALAR M.A review on ensembles for the class imba-lance problem：bagging-，boosting-，and hybrid-based approaches[J].IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews，2012，42（4）：463-484.
[37] DAL POZZOLO A，CAELEN O，BONTEMPI G.When is undersampling effective in unbalanced classification tasks?[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Cham：Springer，2015：200-215.
[38] MAYABADI S，SAADATFAR H.Two density-based sampling approaches for imbalanced and overlapping data[J].Knowledge-Based Systems，2022，241：108217.
[39] YUAN B W，LUO X G，ZHANG Z L，et al.A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets[J].Neural Computing and Applications，2021，33（9）：4457-4481.
[40] 周志华.机器学习[M].北京：清华大学出版社，2016.
ZHOU Z H.Machine learning[M].Beijing：Tsinghua University Press，2016.
[41] FU G H，WU Y J，ZONG M J，et al.Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics[J].Chemometrics and Intelligent Laboratory Systems，2020，196：103906.
[42] BRADLEY P.The use of the area under the ROC curve in the evaluation of machine learning algorithms[J].Pattern Recognition，1997，30（7）：1145-1159.