Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (18): 111-118.DOI: 10.3778/j.issn.1002-8331.1906-0441

Previous Articles     Next Articles

Hybrid Algorithm of DBSCAN and Improved SMOTE for Oversampling

WANG Liang, YE Jimin   

  1. School of Mathematics and Statistics, Xidian University, Xi’an 710126, China
  • Online:2020-09-15 Published:2020-09-10

整合DBSCAN和改进SMOTE的过采样算法

王亮,冶继民   

  1. 西安电子科技大学 数学与统计学院,西安 710126

Abstract:

For conventional oversampling algorithms, for example, SMOTE (Synthetic Minority Over-sampling Technique), there are several problems such as ignoring within-class imbalances, extending the classification regions of minority class and synthesizing highly similar samples. Based on the comprehensive consideration of within-class imbalance and synthetic samples in diversity, an oversampling algorithm, which is a hybrid of DBSCAN and improved SMOTE (DB-MCSMOTE), is proposed. It utilizes the DBSCAN algorithm to cluster the minority class samples. According to the proposed cluster density distribution function, the cluster density and sampling weight of each cluster are calculated. The MCSMOTE algorithm is adopted to oversample on the lines of the location-distant minority class samples in each cluster, the diversity of synthetic samples is improved and a new balanced dataset between and within classes is obtained. Experiments on a two-dimensional synthesis data set and nine UCI data sets show that DB-MCSMOTE can effectively improve the classification performance of the classifier for the minority class samples and the overall data set.

Key words: oversampling, within-class imbalance, minority class, diversity, Synthetic Minority Over-sampling Technique(SMOTE) algorithm, Density-Based Spatial Clustering of Applications with Noise(DBSCAN) algorithm

摘要:

针对SMOTE(Synthetic Minority Over-sampling Technique)等传统过采样算法存在的忽略类内不平衡、扩展少数类的分类区域以及合成的新样本高度相似等问题,基于综合考虑类内不平衡和合成样本多样性的思想,提出了一种整合DBSCAN和改进SMOTE的过采样算法DB-MCSMOTE(DBSCAN and Midpoint Centroid Synthetic Minority Over-sampling Technique)。该算法对少数类样本进行DBSCAN聚类,根据提出的簇密度分布函数,计算各个簇的簇密度和采样权重,在各个簇中利用改进的SMOTE算法(MCSMOTE)在相距较远的少数类样本点之间的连线上进行过采样,提高合成样本的多样性,得到新的类间和类内综合平衡数据集。通过对一个二维合成数据集和九个UCI数据集的实验表明,DB-MCSMOTE可以有效提高分类器对少数类样本和整体数据集的分类性能。

关键词: 过采样, 类内不平衡, 少数类, 多样性, SMOTE算法, DBSCAN算法