Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (24): 12-27.DOI: 10.3778/j.issn.1002-8331.2007-0120

Previous Articles     Next Articles

Machine Learning Classification Strategy for Imbalanced Data Sets

XU Lingling, CHI Dongxiang   

  1. School of Electronic Information Engineering, Shanghai Dianji University, Shanghai 201306, China
  • Online:2020-12-15 Published:2020-12-15

面向不平衡数据集的机器学习分类策略

徐玲玲,迟冬祥   

  1. 上海电机学院 电子信息学院,上海 201306

Abstract:

Due to the inherent characteristics of the imbalanced data set, the classification results are often affected by a large number of categories, resulting in a decline in classification performance. In recent years, a series of research strategies based on improving the accuracy of machine learning classification algorithms for imbalanced data sets have been proposed in order to be able to learn the inherent laws of data from the imbalanced data sets and to tap their potential value. These strategies are mainly based on the data level and the classification model improvement level to solve the difficulty of unbalanced data set classification. From the above two aspects, the machine learning classification strategy for the imbalanced data set classification problem is discussed, the evaluation indicators for the imbalanced data set machine learning classifier are analyzed and discussed, and the existing problems in the imbalanced data set classification are summarized. Finally, looking forward to the direction that can be studied in the future. In particular, the research discusses mainly focuses on the difficulties faced by the binary classification problem in the extreme imbalanced category scenario.

Key words: imbalanced data set, resampling strategy, classification model, evaluation index

摘要:

由于不平衡数据集的内在固有特性,使得分类结果常受数量较多的类别影响,造成分类性能下降。近年来,为了能够从类别不平衡的数据集中学习数据的内在规律并且挖掘其潜在的价值,提出了一系列基于提升不平衡数据集机器学习分类算法准确率的研究策略。这些策略主要是立足于数据层面、分类模型改进层面来解决不平衡数据集分类难的困扰。从以上两个方面论述面向不平衡数据集分类问题的机器学习分类策略,分析和讨论了针对不平衡数据集机器学习分类器的评价指标,总结了不平衡数据集分类尚存在的问题,展望了未来能够深入研究的方向。特别的,这些讨论的研究主要关注类别极端不平衡场景下的二分类问题所面临的困难。

关键词: 不平衡数据集, 重采样策略, 分类模型, 评价指标