计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (1): 139-143.DOI: 10.3778/j.issn.1002-8331.2011.01.038

• 数据库、信号与信息处理 • 上一篇    下一篇

一种基于过抽样技术的非平衡数据集分类方法

王春玉,苏宏业,渠 瑜,褚 健   

  1. 浙江大学 智能系统与控制研究所 工业控制技术国家重点实验室,杭州 310027

  • 收稿日期:2010-08-09 修回日期:2010-10-14 出版日期:2011-01-01 发布日期:2011-01-01
  • 通讯作者: 王春玉

Imbalanced data sets classification method based on over-sampling technique

WANG Chunyu,SU Hongye,QU Yu,CHU Jian   

  1. State Key Lab of Industrial Control Technology,Institute of Cyber-Systems & Control,Zhejiang University,Hangzhou 310027,China
  • Received:2010-08-09 Revised:2010-10-14 Online:2011-01-01 Published:2011-01-01
  • Contact: WANG Chunyu

摘要: 非平衡数据集的分类问题是机器学习领域的一个研究热点。针对非平衡数据集分类困难的问题,特别是由于非平衡分布引起的少数类识别能力低下的问题,提出了一种改进算法,AdaBoost-SVM-OBMS。该算法结合Boosting算法和基于错分样本产生新样本的过抽样技术。在新算法中,以支持向量机为元分类器,每次Boosting迭代中标记出错分的样本点,然后在错分样本点与其近邻间随机产生一定数量与错分样本同一类别的新样本点。新产生样本点加入原训练集中重新训练学习,以提高分类困难样本的识别能力。在AUC,F-value和G-mean 3个不同价格的评价指标下8个benchmark数据集上对AdaBoost-SVM-OBMS算法与AdaBoost-SVM算法和APLSC算法进行了对比实验,实验结果表明了AdaBoost-SVM-OBMS算法在非平衡数据集分类中的有效性。

Abstract: Classification of data with imbalanced class distribution is a research focus on machine learning.In order to resolve the imbalanced problems,especially those of the poor predictive accuracy over the minority class,this paper presents an improved approach,AdaBoost-SVM-OBMS,which is based on a combination of Boosting,an ensemble-based learning algorithm,and an improved over-sampling method based on misclassified samples.In this approach,using support vector machine as base classifier,the misclassified samples are identified during each iteration.Subsequently,they are used to separately generate new samples for the majority and minority classes.The new samples are then added to the original training set to retrain the classification model,which is used to improve the prediction of hard samples.This method is evaluated,in terms of the AUC,F-value,and G-mean,on eight imbalanced data sets.Results indicate that the improved approach produces high prediction in imbalanced data sets.

中图分类号: