Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (23): 268-277.DOI: 10.3778/j.issn.1002-8331.2207-0442

• Engineering and Applications • Previous Articles     Next Articles

Posterior Probability and Density-Based Imbalanced Data Undersampling

REN Yanping,  ZHENG Zhong,  JIANG Yifei,  YAN Yuanting,  ZHANG Yanping   

  1. College of Computer Science and Technology, Anhui University, Hefei 230601, China
  • Online:2022-12-01 Published:2022-12-01

融合后验概率和密度的不平衡数据欠采样方法

任艳平,郑  重,江一飞,严远亭,张燕平   

  1. 安徽大学 计算机科学与技术学院,合肥 230601

Abstract: Undersampling is one of the most popular methods for dealing with class imbalance problem. Existing research shows that efficient class overlap handling can improve the performance of imbalanced oversampling. However, most of the current undersampling researches claim that the loss of key samples due to improper sample selection strategy is the main reason affecting the performance of undersampling methods. Therefore, researchers have proposed a series of methods to select the informative majority samples, but studies on handing class overlap in undersampling are still open. In this paper, an undersampling method based on Bayes posterior probability and distribution density(BPDDUS) is proposed to detect and clean samples in overlapping areas firstly, and it undersamples the remaining samples according to the distribution information of the majority samples. Specifically, the method first cleans the potential noise and overlapping samples in the majority class by Bayes posterior probability to enhance the clarity of the classification decision boundary, the global distribution density and information entropy are introduced to measure the importance of the samples and assign the corresponding sampling weights. Finally, an ensemble classification is constructed to improve the generalization ability of the model. The validity of the proposed BPDDUS method is verified by numerical experiments on 43 KEEL databases.

Key words: imbalanced data, undersampling, Bayes posterior probability, global distribution density, ensemble classification, information entropy

摘要: 欠采样是当前解决类不平衡问题的主流方法之一。现有研究表明,高效地处理类别重叠能够有效提升过采样方法的性能。然而,目前对欠采样的研究大多认为由于样本选择策略不当而导致的关键样本丢失是影响欠采样方法性能的主要原因,为此,研究者从不同的角度提出了一系列针对性的方法,但鲜有对欠采样中类别重叠的研究。提出一种融合贝叶斯后验概率和分布密度的欠采样方法(BPDDUS)实现重叠区域样本的检测和清洗,并通过样本的分布信息对清洗后的样本进行欠采样。具体来说,该方法通过贝叶斯后验概率对多数类样本中潜在的噪声和重叠样本进行清洗以增强分类决策边界的清晰度。对清洗后的多数类样本,引入全局分布密度和信息熵来度量样本对不平衡数据分类学习的重要程度并对其分配相应的采样权重。按样本权重欠采样并构建集成分类系统,以提升模型的泛化能力。在43个KEEL数据库数据集上进行的数值实验验证了所提的BPDDUS方法的有效性。

关键词: 不平衡数据, 欠采样, 贝叶斯后验概率, 全局分布密度, 集成分类, 信息熵