计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (16): 51-55.

• 理论与研发 • 上一篇    下一篇

折中规划分类性能的少数类误分代价优化设计

靳  燕1,彭新光2   

  1. 1.山西大学 商务学院 信息学院,太原 030031
    2.太原理工大学 计算机科学与技术学院,太原 030024
  • 出版日期:2016-08-15 发布日期:2016-08-12

Optimization design of minorityclass misclassification cost based on classification performance compromise plan

JIN Yan1, PENG Xinguang2   

  1. 1.Information Institute, Business College of Shanxi University, Taiyuan 030031, China
    2.College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, China
  • Online:2016-08-15 Published:2016-08-12

摘要: 针对代价敏感思想在类不平衡问题中的传统代价给定方式,提出了分类性能需求引导代价优化的因子量化方法。分类性能需求表示为相关于代价因子[c]的正负类分类性能指标函数式,为代价择优标准。应用遗传算法基于该标准在指定值域内寻优,得到最优代价因子,并将其代入代价敏感Boosting学习方法,产生基于给定分类性能的分类模型。折中分类性能的算法实现以正负类召回率的几何平均作为择优标准,选用了四类算法(基算法C4.5和ZeroR)依次在三组样本集上进行分类建模。与传统代价给定方式代入算法相比,寻优过程确定的代价因子代入AdaCost算法后,基于C4.5和ZeroR的分类器在TP与TN上的变化幅度依次为33.3%~200%、[-49%~-15.6%]和[-44.4%~-16.7%、]25%~400%。前者改善了正类误判情形,且未造成负类误判严重化;后者改善了负类严重误判情形,且正类召回率保持在0.5以上,分类性能达到较为均衡的状态。

关键词: 少数类分类, 代价敏感学习, 遗传算法, 代价因子优化, 分类性能均衡

Abstract: Focused on the issue that cost value of cost-sensitive method in class imbalances is given traditionally based on the numbers of samples from different classes, a new quantization method that optimization process chose performance requirements as selection criterion is proposed. Performance requirements related to the cost factor [c] are the expressions for performance indicators of positive and negative class. The optimal cost value is searched by genetic algorithm based on a given performance expression. Classifier is learned by cost-sensitive boosting method with the optimal cost parameter. To meet the requirement of performance compromise, algorithm is implemented that chooses geometric mean of recall rates of positive and negative classes as selection criterion. Four algorithms are used to learn classifiers on three datasets. Compared with the traditional way of cost value given, the performance of classifiers those learned by AdaCost algorithm based on C4.5 and ZeroR increments 33.3% to 200% and [-44.4%] to [-16.7%] in TP, and [-49%] to [-15.6%] and 25% to 400% in TN. The former decreases misjudgment cases of positive class and do not cause serious misjudgment of negative class;the latter decreases serious misjudgment cases of negative class, and the recall rate of positive class is equal to or greater than 0.5, the classification performance is more balanced and stable.

Key words: minority class classification, cost sensitive learning, genetic algorithm, misclassification cost optimization, classification performance compromise