Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (18): 168-173.DOI: 10.3778/j.issn.1002-8331.1705-0334

Previous Articles     Next Articles

Research on classification algorithm of imbalanced datasets based on improved SMOTE

ZHAO Qinghua, ZHANG Yihao, MA Jianfen, DUAN Qianqian   

  1. MicroNano System Research Center, College of Information Engineering & Key Lab of Advanced Transducers and Intelligent Control System(Ministry of Education), Taiyuan University of Technology, Taiyuan 030600, China
  • Online:2018-09-15 Published:2018-10-16

改进SMOTE的非平衡数据集分类算法研究

赵清华,张艺豪,马建芬,段倩倩   

  1. 太原理工大学?信息工程学院&新型传感器和智能控制教育部(山西)重点实验室 微纳系统研究中心,太原?030600

Abstract: There are dataset marginal distribution problem and the computational complexity shortcomings using random forest combined SMOTE algorithm in dealing with imbalanced dataset. This paper proposes a TSMOTE algorithm (triangle SMOTE) and MDSMOTE algorithm (Max Distance SMOTE). The core idea of the improved algorithm is to restrict the generation  of new samples in a certain area, so that the distribution of the sample set tends to be centralized, which reduces the complexity of the traditional SMOTE algorithm and the time complexity of the algorithm. Extensive experiments on six imbalanced datasets show that the improved algorithm reduces the time consumption and achieves higher G-mean value, F-value value, AUC value compared with the state-of-art method SMOTE.

Key words: random forest, SMOTE algorithm, imbalanced dataset

摘要: 针对随机森林和SMOTE组合算法在处理不平衡数据集上存在数据集边缘化分布以及计算复杂度大等问题,提出了基于SMOTE的改进算法TSMOTE(triangle SMOTE)和MDSMOTE(Max Distance SMOTE),其核心思想是将新样本的产生限制在一定区域,使得样本集分布趋于中心化,用更少的正类样本点人为构造样本,从而达到限制样本区域、降低算法复杂度的目的。在6种不平衡数据集上的大量实验表明,改进算法与传统算法相比,算法消耗时间大幅减少,取得更高的G-mean值、F-value值和AUC值。

关键词: 随机森林, SMOTE算法, 不平衡数据集