计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (11): 120-125.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

考虑类内不平衡的谱聚类过抽样方法

骆自超,金  隼,邱雪峰   

  1. 上海交通大学 机械与动力工程学院,上海 200000
  • 出版日期:2014-06-01 发布日期:2015-04-08

Spectral clustering based oversampling:oversampling taking within class imbalance into consideration

LUO Zichao, JIN Sun, QIU Xuefeng   

  1. School of Mechanical Engineering, Shanghai Jiaotong University, Shanghai 200000, China
  • Online:2014-06-01 Published:2015-04-08

摘要: 不平衡数据分类问题是数据挖掘领域的关键挑战之一。过抽样方法是解决不平衡分类问题的一种有效手段。传统过抽样方法没有考虑类内不平衡,为此提出基于改进谱聚类的过抽样方法。该方法首先自动确定聚类簇数,并对少数类样本进行谱聚类,再根据各类内包含样本数与总少数类样本数之比,确定在类内合成的样本数量,最后通过在类内进行过抽样,获得平衡的新数据集。在4个实际数据集上验证了算法的有效性。并在二维合成数据集上对比k均值聚类和改进谱聚类的结果,解释基于两种不同聚类的过抽样算法性能差异的原因。

关键词: 谱聚类, 不平衡数据集, 过抽样

Abstract: Imbalanced datasets are one of the most crucial challenges encountered by data mining techniques. Oversampling has been proven to be a very effective method in dealing with imbalanced datasets. However, traditional oversampling methods pay no attention to within class imbalance which is pervasive in real world datasets. To resolve this problem, this paper proposes an oversampling method based on modified spectral clustering. This method first automatically decides the best number of clusters. Then modified spectral clustering is applied to minority samples. Based on the number of samples contained in each cluster, this proposal judges the number of samples which shall be generated inside each cluster to get a dataset which is balanced both between and within class. This method is tested in 4 real world datasets and one simulated dataset. It is proven to be effective. Moreover, a comparison between traditional k-means clustering based oversampling and the method proposed in this paper is conducted. The results are analyzed and explained.

Key words: spectral clustering, imbalanced dataset, oversampling