计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (21): 128-132.DOI: 10.3778/j.issn.1002-8331.1707-0314

• 模式识别与人工智能 • 上一篇    下一篇

不平衡数据度量指标优化的提升分类方法

闫建红   

  1. 太原师范学院 计算机系,太原 030012
  • 出版日期:2018-11-01 发布日期:2018-10-30

Optimization boosting classification based on metrics of imbalanced data

YAN Jianhong   

  1. Department of Computer Science, Taiyuan Normal University, Taiyuan 030012, China
  • Online:2018-11-01 Published:2018-10-30

摘要: 为提高不平衡数据的分类性能,提出了基于度量指标优化的不平衡数据Boosting算法。该算法结合不平衡数据分类性能度量标准和Boosting算法,使用不平衡数据分类性能度量指标代替原有误分率指标,分别采用带有权重的正类和负类召回率、F-measure和G-means指标对Boosting算法进行优化,按照不同的度量指标计算Alpha 值进行迭代,得到带有加权值的弱学习器组合,最后使用Boosting算法进行优化。经过实验验证,与带有权重的Boosting算法进行比较,该算法对一定数据集的AUC分类性能指标有一定提高,错误率有所下降,对F-measure和G-mean性能指标有一定的改善,说明该算法侧重提高正类分类性能,改善不平衡数据的整体分类性能。

关键词: 不平衡数据集, 二分类, 曲线下面积(AUC), 度量指标优化, Boosting算法

Abstract: In order to improve the classification performance of imbalanced data, optimization boosting of imbalanced data based on metrics is proposed. Instead of error rate, the algorithm uses metric of imbalanced data classification into Boosting algorithm, the optimization parameters select recall rate of positive and negative with weigh factor, F-measure, G-means. The Alpha value is calculated by the different metrics, combination of weak learners with weighted values is obtained, Boosting algorithm is optimized by above parameters. In the comparison experiments with Boosting algorithm, performance Metrics of AUC values, F-measure and G-mean increased, misclassification rate decreases in some data set, the experiments results show that the algorithm can effectively improve classification performance of imbalanced data.

Key words: imbalanced data sets, binary classification, Area Under roc Curve(AUC), metrics optimization, Boosting algorithm