面向不平衡数据集的机器学习分类策略

doi:10.3778/j.issn.1002-8331.2007-0120

计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (24): 12-27.DOI: 10.3778/j.issn.1002-8331.2007-0120

面向不平衡数据集的机器学习分类策略

徐玲玲，迟冬祥

上海电机学院电子信息学院，上海 201306

出版日期:2020-12-15 发布日期:2020-12-15

Machine Learning Classification Strategy for Imbalanced Data Sets

XU Lingling, CHI Dongxiang

School of Electronic Information Engineering, Shanghai Dianji University, Shanghai 201306, China

Online:2020-12-15 Published:2020-12-15

摘要/Abstract

摘要：

由于不平衡数据集的内在固有特性，使得分类结果常受数量较多的类别影响，造成分类性能下降。近年来，为了能够从类别不平衡的数据集中学习数据的内在规律并且挖掘其潜在的价值，提出了一系列基于提升不平衡数据集机器学习分类算法准确率的研究策略。这些策略主要是立足于数据层面、分类模型改进层面来解决不平衡数据集分类难的困扰。从以上两个方面论述面向不平衡数据集分类问题的机器学习分类策略，分析和讨论了针对不平衡数据集机器学习分类器的评价指标，总结了不平衡数据集分类尚存在的问题，展望了未来能够深入研究的方向。特别的，这些讨论的研究主要关注类别极端不平衡场景下的二分类问题所面临的困难。

关键词: 不平衡数据集, 重采样策略, 分类模型, 评价指标

Abstract:

Due to the inherent characteristics of the imbalanced data set, the classification results are often affected by a large number of categories, resulting in a decline in classification performance. In recent years, a series of research strategies based on improving the accuracy of machine learning classification algorithms for imbalanced data sets have been proposed in order to be able to learn the inherent laws of data from the imbalanced data sets and to tap their potential value. These strategies are mainly based on the data level and the classification model improvement level to solve the difficulty of unbalanced data set classification. From the above two aspects, the machine learning classification strategy for the imbalanced data set classification problem is discussed, the evaluation indicators for the imbalanced data set machine learning classifier are analyzed and discussed, and the existing problems in the imbalanced data set classification are summarized. Finally, looking forward to the direction that can be studied in the future. In particular, the research discusses mainly focuses on the difficulties faced by the binary classification problem in the extreme imbalanced category scenario.

Key words: imbalanced data set, resampling strategy, classification model, evaluation index

徐玲玲，迟冬祥. 面向不平衡数据集的机器学习分类策略[J]. 计算机工程与应用, 2020, 56(24): 12-27.

XU Lingling, CHI Dongxiang. Machine Learning Classification Strategy for Imbalanced Data Sets[J]. Computer Engineering and Applications, 2020, 56(24): 12-27.

[1]	王乐，韩萌，李小娟，张妮，程浩东. 不平衡数据集分类方法综述[J]. 计算机工程与应用, 2021, 57(22): 42-52.
[2]	孟东霞，李玉鑑. 利用自然最近邻的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(2): 91-96.
[3]	孟东霞，李玉鑑. 融合特征边界信息的不平衡数据过采样方法[J]. 计算机工程与应用, 2020, 56(14): 156-160.
[4]	邵良杉，李臣浩. 基于改进花粉算法的极限学习机分类模型[J]. 计算机工程与应用, 2020, 56(1): 172-179.
[5]	罗康洋，王国强. L-SMOTE与SVM结合的不平衡数据集分类研究[J]. 计算机工程与应用, 2019, 55(17): 55-62.
[6]	张明，胡晓辉，吴嘉昕. 基于混合采样的不平衡数据集算法研究[J]. 计算机工程与应用, 2019, 55(17): 68-75.
[7]	高东1，许欣2. 电梯群控算法评价指标与验证平台的研究与实现[J]. 计算机工程与应用, 2018, 54(5): 231-235.
[8]	闫建红. 不平衡数据度量指标优化的提升分类方法[J]. 计算机工程与应用, 2018, 54(21): 128-132.
[9]	刘栋，聂仁灿，周冬明，侯瑞超，熊磊. 结合NSST与GA参数优化PCNN图像融合[J]. 计算机工程与应用, 2018, 54(19): 158-163.
[10]	赵清华，张艺豪，马建芬，段倩倩. 改进SMOTE的非平衡数据集分类算法研究[J]. 计算机工程与应用, 2018, 54(18): 168-173.
[11]	王超学1，张涛1，马春森2. 改进SVM-KNN的不平衡数据分类[J]. 计算机工程与应用, 2016, 52(4): 51-55.
[12]	张文宇1，陶蓉1，陈星2，任露1. 改进财务评价指标体系的筛选研究[J]. 计算机工程与应用, 2016, 52(23): 102-108.
[13]	刘伟，谢兴生，肖超峰. 一种基于支持向量阈值控制的优化增量SVM算法[J]. 计算机工程与应用, 2015, 51(3): 124-128.
[14]	戈军1，周莲英2. 基于SARSA（λ）的实时交通信号控制模型[J]. 计算机工程与应用, 2015, 51(24): 244-248.
[15]	王超学1，张涛1，马春森2. 基于聚类权重分阶段的SVM解不平衡数据集分类[J]. 计算机工程与应用, 2015, 51(21): 133-137.

面向不平衡数据集的机器学习分类策略

Machine Learning Classification Strategy for Imbalanced Data Sets

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics