Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (21): 67-74.DOI: 10.3778/j.issn.1002-8331.2103-0564

• Theory, Research and Development • Previous Articles     Next Articles

Oversampling Algorithm Based on Density Peak Clustering and Radial Basis Function

LU Miaofang, YANG Youlong   

  1. School of Mathematics and Statistics, Xidian University, Xi’an 710126, China
  • Online:2022-11-01 Published:2022-11-01

基于密度峰值聚类和径向基函数的过采样算法

陆妙芳,杨有龙   

  1. 西安电子科技大学 数学与统计学院,西安 710126

Abstract: Most of the existing oversampling algorithms only consider the distribution of the minority instances but ignore the distribution of the majority instances in the sampling process. In addition to the problem of imbalance between classes, the data set also has the problem of imbalance within classes. To solve these problems, this paper proposes a new oversampling method based on density peak clustering and radial basis function. Firstly, the minority instances are adaptively clustered by the improved density peak clustering algorithm, and a number of minority sub-clusters are obtained. Secondly, the local density calculated by the clustering process is used to assign weights to each sub-cluster, which are used to determine the required number of each sub-cluster. Finally, the radial basis function is used to calculate the mutual minority class potential of each minority instances, and the minority class is oversampled based on the mutual minority class potential. The proposed algorithm is combined with different classifiers to conduct experiments, and different indicators are used to evaluate the performance. The experiment shows that the performance of the proposed algorithm is better.

Key words: imbalanced data, oversampling, density peak clustering, radial basis function

摘要: 现有的大多数过采样算法在采样过程中只考虑少数类样本的分布而忽略多数类样本的分布,且数据集除了存在类间不平衡问题之外,还存在类内不平衡问题。针对这些问题,提出一种基于密度峰值聚类和径向基函数的过采样方法。该方法首先利用改进的密度峰值聚类算法自适应地为少数类聚类,获得多个子簇;利用聚类过程计算所得的局部密度为各子簇分配权重,并根据权重确定各子簇的过采样量;用径向基函数计算少数类样本的相互类势,以相互类势为依据对少数类进行过采样。将算法与不同分类器结合进行实验,用不同指标评价分类效果,实验表明,该算法的分类效果较优。

关键词: 不平衡数据, 过采样, 密度峰值聚类, 径向基函数