Spectral clustering based oversampling：oversampling taking within class imbalance into consideration

Abstract

Abstract: Imbalanced datasets are one of the most crucial challenges encountered by data mining techniques. Oversampling has been proven to be a very effective method in dealing with imbalanced datasets. However, traditional oversampling methods pay no attention to within class imbalance which is pervasive in real world datasets. To resolve this problem, this paper proposes an oversampling method based on modified spectral clustering. This method first automatically decides the best number of clusters. Then modified spectral clustering is applied to minority samples. Based on the number of samples contained in each cluster, this proposal judges the number of samples which shall be generated inside each cluster to get a dataset which is balanced both between and within class. This method is tested in 4 real world datasets and one simulated dataset. It is proven to be effective. Moreover, a comparison between traditional k-means clustering based oversampling and the method proposed in this paper is conducted. The results are analyzed and explained.

Key words: spectral clustering, imbalanced dataset, oversampling

摘要： 不平衡数据分类问题是数据挖掘领域的关键挑战之一。过抽样方法是解决不平衡分类问题的一种有效手段。传统过抽样方法没有考虑类内不平衡，为此提出基于改进谱聚类的过抽样方法。该方法首先自动确定聚类簇数，并对少数类样本进行谱聚类，再根据各类内包含样本数与总少数类样本数之比，确定在类内合成的样本数量，最后通过在类内进行过抽样，获得平衡的新数据集。在4个实际数据集上验证了算法的有效性。并在二维合成数据集上对比k均值聚类和改进谱聚类的结果，解释基于两种不同聚类的过抽样算法性能差异的原因。

关键词: 谱聚类, 不平衡数据集, 过抽样

LUO Zichao, JIN Sun, QIU Xuefeng. Spectral clustering based oversampling：oversampling taking within class imbalance into consideration[J]. Computer Engineering and Applications, 2014, 50(11): 120-125.

骆自超，金隼，邱雪峰. 考虑类内不平衡的谱聚类过抽样方法[J]. 计算机工程与应用, 2014, 50(11): 120-125.

[1]	ZHAO Fan, ZHANG Lin, WEN Zhiquan, YANG Linlin, LIN Guangfeng. Direct and Efficient Natural Scene Chinese Character Approaching Spotting Method [J]. Computer Engineering and Applications, 2021, 57(6): 159-167.
[2]	BAI Lu, ZHAO Xin, KONG Yuting, ZHANG Zhenghang, SHAO Jinxin, QIAN Yurong. Survey of Spectral Clustering Algorithms [J]. Computer Engineering and Applications, 2021, 57(14): 15-26.
[3]	ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework [J]. Computer Engineering and Applications, 2021, 57(13): 96-101.
[4]	WEN Tingxin, KONG Xiangbo. Research on Extreme Risk Warning in Financial Market from Imbalance Distribution of Samples [J]. Computer Engineering and Applications, 2020, 56(8): 256-260.
[5]	WANG Yusi, LU Deyang, LI Haiyang. Sparse Subspace Clustering Method Based on Fractional Function Constraints [J]. Computer Engineering and Applications, 2020, 56(7): 39-47.
[6]	ZHU Dan, CHEN Xiaohong, WU Qingyuan, LI Shunming. Subspace Clustering Induced by Adaptive Graph Learning [J]. Computer Engineering and Applications, 2020, 56(21): 30-37.
[7]	YANG Jingya, SUN Linfu, WU Qishi. After-Sales Customer Segmentation Based on Semi-Supervised Spectral Clustering Ensemble [J]. Computer Engineering and Applications, 2020, 56(2): 266-271.
[8]	WANG Liang, YE Jimin. Hybrid Algorithm of DBSCAN and Improved SMOTE for Oversampling [J]. Computer Engineering and Applications, 2020, 56(18): 111-118.
[9]	YANG Lu, SONG Huansheng, ZHANG Zhaoyang. Highway Vehicle Detection Based on Sparse Trajectory Clustering [J]. Computer Engineering and Applications, 2020, 56(15): 251-258.
[10]	MENG Dongxia, LI Yujian. Oversampling Method for Unbalanced Data Based on Information of Characteristic Boundary [J]. Computer Engineering and Applications, 2020, 56(14): 156-160.
[11]	LIU Chao, WU Shen, ZHENG Yichao, HOU Weiyan. Classification of Cancer Based on Deep Forest and DNA Methylation [J]. Computer Engineering and Applications, 2020, 56(13): 189-193.
[12]	ZHANG Jiawei, GUO Linming, YANG Xiaomei. Improved Oversampling and Random Forest Algorithm for Imbalanced Data [J]. Computer Engineering and Applications, 2020, 56(11): 39-45.
[13]	JIANG Yirui, PEI Yang, CHEN Lei, WANG Wenle, DAI Jiangyan, YI Yugen. Multiple Locality-Constrained Self-Representation for Spectral Clustering [J]. Computer Engineering and Applications, 2020, 56(11): 172-178.
[14]	GAO Mingzhe1, XU Aiqiang1, XU Qing2. Fault Detection Method of Electronic Equipment Based on SL-SMOTE and CS-RVM [J]. Computer Engineering and Applications, 2019, 55(4): 185-192.
[15]	SU Chong, REN Tong, WANG Guopin, YIN Jie. Using K-L Divergence Based Decision Tree to Build Traditional Chinese Medicine Diagnosis Model on COPD [J]. Computer Engineering and Applications, 2019, 55(3): 225-230.

Spectral clustering based oversampling：oversampling taking within class imbalance into consideration

考虑类内不平衡的谱聚类过抽样方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics