Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (17): 68-75.DOI: 10.3778/j.issn.1002-8331.1804-0307

Previous Articles     Next Articles

Imbalanced Data Processing Algorithm Based on Mixed Sampling

ZHANG Ming, HU Xiaohui, WU Jiaxin   

  1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
  • Online:2019-09-01 Published:2019-08-30

基于混合采样的不平衡数据集算法研究

张明,胡晓辉,吴嘉昕   

  1. 兰州交通大学 电子与信息工程学院,兰州 730070

Abstract: Aiming to solve the poor performance of imbalanced datasets classification, a novel imbalanced datasets classification algorithm based on mixed sampling(BSI) is proposed. This method firstly introduces coefficient of variation to find out the sparse domain and dense domain samples, and then deals with them in different ways, an oversampling method(BSMOTE) is proposed to improve the SMOTE algorithm for the minority samples in sparse domain. An improved undersampling method(IS) is proposed for the majority samples in dense domain. Finally, experiments on six imbalanced datasets show that the algorithm achieves higher G-mean value, F-value value, AUC value, and improves the comprehensive performance of imbalanced datasets classification effectively.

Key words: imbalanced datasets, coefficient of variation, SMOTE algorithm, undersampling

摘要: 针对不平衡数据集分类效果不理想的问题,提出了一种新的基于混合采样的不平衡数据集算法(BSI)。通过引进“变异系数”找出样本的稀疏域和密集域,针对稀疏域中的少数类样本,提出了一种改进SMOTE算法的过采样方法(BSMOTE);对密集域中的多数类样本,提出了一种改进的欠采样方法(IS)。通过在六种不平衡数据集上的实验表明,该算法与传统算法相比,取得了更高的G-mean值、F-value值、AUC值,有效改善了不平衡数据集的综合分类性能。

关键词: 不平衡数据集, 变异系数, SMOTE算法, 欠采样