计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (21): 1-17.DOI: 10.3778/j.issn.1002-8331.1907-0040

• 热点与综述 • 上一篇    下一篇

类别不均衡学习中的抽样策略研究

刘树栋,张可   

  1. 1.中南财经政法大学 人工智能法商应用研究中心,武汉 430073
    2.中南财经政法大学 信息与安全工程学院,武汉 430073
  • 出版日期:2019-11-01 发布日期:2019-10-30

Research on Sampling Strategies in Class-Imbalanced Learning

LIU Shudong, ZHANG Ke   

  1. 1.Centre for Artificial Intelligence and Applied Research, Zhongnan University of Economics and Law, Wuhan 430073, China
    2.School of Information and Security Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
  • Online:2019-11-01 Published:2019-10-30

摘要: 类别不均衡学习在信用评估、客户流失预测、医学诊断、短文本情感分析、标记学习、评分预测等众多领域有广泛的应用,是机器学习研究和应用的热点方向之一,近年来逐渐引起学术界和工业界的广泛关注。目前解决类别不均衡问题主要有三种方法:数据级解决方法、算法级解决方法和集成解决方法。侧重于对近年来类别不均衡学习中的抽样策略研究进展进行综述,介绍类别不均衡学习的基本框架,对类别不均衡学习中三种主要的抽样策略(过抽样、欠抽样和混合抽样)相关研究进展进行前沿概括、比较和分析,对类别不均衡学习的抽样策略中有待研究的难点、热点及发展趋势进行展望。

关键词: 不均衡学习, 集成学习, 欠抽样, 特征选择, 支持向量机, 合成少数类过抽样技术, 混合抽样

Abstract: Class-imbalanced learning has been widely used in many application domains, such as credit scoring, customer churn prediction, medical diagnosis, short-text sentiment analysis, label learning, review prediction, which has become one of the hottest topics in domain of machine learning and its applications, and are attracting more and more attention from both industry and academia recently. A great variety of solutions have been proposed to address class imbalance problem, which can be generally divided into three groups: data-level solutions, algorithm-level solutions and ensemble solutions. This paper presents an overview of the field of sampling strategies in class-imbalanced learning, which are more important methods in data-level solutions. This paper introduces the basic issue of class-imbalanced learning, including the formal definition, performance metrics and the basic framework, reviews in detail the recent development of over-sampling, under-sampling and hybrid sampling, which are three main sampling strategies in class-imbalanced learning. The prospects for future development and suggestions for possible extensions are also discussed.

Key words: class-imbalanced learning, ensemble learning, undersampling, feature selection, support vector machine, synthetic minority oversampling technique, hybrid sampling