Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (22): 42-52.DOI: 10.3778/j.issn.1002-8331.2107-0097

• Research Hotspots and Reviews • Previous Articles     Next Articles

Review of Classification Methods for Unbalanced Data Sets

WANG Le, HAN Meng, LI Xiaojuan, ZHANG Ni, CHENG Haodong   

  1. School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China
  • Online:2021-11-15 Published:2021-11-16

不平衡数据集分类方法综述

王乐,韩萌,李小娟,张妮,程浩东   

  1. 北方民族大学 计算机科学与工程学院,银川 750021

Abstract:

The characteristics of unbalanced data sets lead to many difficult problems in classification. The classification methods of unbalanced data sets are analyzed and summarized. Firstly, the classification methods of unbalanced data sets are introduced from three perspectives of under-sampling, over-sampling and mixed sampling in detail. In the under-sampling method, it is divided into three technical methods based on [K]-Nearest Neighbor[(KNN)], Bagging and Boosting. In the over-sampling method, the classification method is analyzed from the perspectives of Synthetic Minority Over-sampling Technology(SMOTE) and Support Vector Machine(SVM). The advantages and disadvantages of the algorithm are compared, and the performance of the algorithm is analyzed and summarized under the same data sets. Then, the classification methods of unbalanced data sets are summarized from four aspects:deep learning, extreme learning machine, cost sensitivity and feature selection. Finally, the future work direction is prospected.

Key words: unbalanced data set, classification, sampling method, [K]-Nearest Neighbor[(KNN)], Synthetic Minority Over-sampling Technology(SMOTE), deep learning

摘要:

不平衡数据集的特点导致了在分类时产生了诸多难题。对不平衡数据集的分类方法进行了分析与总结。在数据采样方法中从欠采样、过采样和混合采样三方面介绍不平衡数据集的分类方法;在欠采样方法中分为基于[K]近邻、Bagging和Boosting三种方法;在过采样方法中从合成少数过采样技术(Synthetic Minority Over-sampling Technology,SMOTE)、支持向量机(Support Vector Machine,SVM)两个角度来分析不平衡数据集的分类方法;对这两类采样方法的优缺点进行了比较,在相同数据集下比较算法的性能并进行分析与总结。从深度学习、极限学习机、代价敏感和特征选择四方面对不平衡数据集的分类方法进行了归纳。最后对下一步工作方向进行了展望。

关键词: 不平衡数据集, 分类, 采样方法, [K]近邻[(KNN)], 合成少数过采样技术(SMOTE), 深度学习