Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (28): 169-172.

• 图形、图像、模式识别 • Previous Articles     Next Articles

New algorithm of AdaBoost for unbalanced datasets

WANG Canwei1,2,4,YU Zhilou3,ZHANG Huaxiang1   

  1. 1.Department of Information Science and Engineering,Shandong Normal University,Jinan 250014,China
    2.Department of Information and Engineering,Shandong Trade Union Cadre Institute,Jinan 250100,China
    3.Inspur Group,Jinan 250101,China
    4.Shandong Province Distributed Computer Software New Technique Key Laboratory,Jinan 250014,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-10-01 Published:2011-10-01

一种适合不平衡数据集的新型提升算法

王灿伟1,2,4,于治楼3,张化祥1   

  1. 1.山东师范大学 信息科学与工程学院,济南 250014
    2.山东工会管理干部学院 信息工程学院,济南 250100
    3.浪潮集团有限公司,济南 250101
    4.山东省分布式计算机软件新技术重点实验室,济南 250014

Abstract: A new training method of AdaBoost(ILAdaboost) which is good for unbalanced datasets is proposed in this paper.The algorithm evaluates the original data with the base classifier of each iteration.It divides the original dataset into four subsets,and then re-samples in the four subsets to form the balanced datasets,using for the base classifier learning in the next iteration.Due to the inclination to the minority and the false classified majority in the process of re-sampling,the decision surface in using synthetic classifier deviates from the minority.Based on the experiment of the 10 classical unbalanced datasets from UCI,the algorithm greatly increases the accuracy of minority and the GMA,keeping the accuracy of majority.

Key words: unbalanced dataset, ensemble learning, AdaBoost, re-sample

摘要: 提出了一种新的适用于不平衡数据集的Adaboost算法(ILAdaboost),该算法利用每一轮学习到的基分类器对原始数据集进行测试评估,并根据评估结果将原始数据集分成四个子集,然后在四个子集中重新采样形成平衡的数据集供下一轮基分类器学习,由于抽样过程中更加倾向于少数类和分错的多数类,故合成分类器的分界面会偏离少数类。该算法在UCI的10个典型不平衡数据集上进行实验,在保证多数类分类精度的同时提高了少数类的分类精度以及GMA。

关键词: 不平衡数据集, 集成学习, AdaBoost, 重采样