Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (11): 39-45.DOI: 10.3778/j.issn.1002-8331.1908-0338

Previous Articles     Next Articles

Improved Oversampling and Random Forest Algorithm for Imbalanced Data

ZHANG Jiawei, GUO Linming, YANG Xiaomei   

  1. School of Electrical Engineering, Sichuan University, Chengdu 610000, China
  • Online:2020-06-01 Published:2020-06-01

针对不平衡数据的过采样和随机森林改进算法

张家伟,郭林明,杨晓梅   

  1. 四川大学 电气工程学院,成都 610000

Abstract:

To solve the problem of low recognition rate for minority samples due to imbalanced data, an improved algorithm based on weighted oversampling and random forest is proposed to reduce the influence of imbalanced data on classifier. In data preprocessing step, weighted oversampling based on Synthetic Minority Oversampling Technique(SMOTE) is applied to reduce the data imbalanced rate. Weights are determined by the Euclidean distance between each sample and the rest in minority class, new samples with different number are generated by weighting samples of minority class. To improve the random forest, Kappa coefficient is used to evaluate the classification performance of decision tree, and corresponding weight is given to each tree. It makes trees with better performance having more voting rights at final voting stage. Experiments on KEEL datasets show that the proposed algorithm improves the classification accuracy for minority samples and the classification performance of the imbalanced datasets compared with unimproved algorithm.

Key words: imbalanced data, Synthetic Minority Oversampling Technique(SMOTE), Kappa coefficient, random forest

摘要:

针对数据不平衡带来的少数类样本识别率低的问题,提出通过加权策略对过采样和随机森林进行改进的算法,从数据预处理和算法两个方面降低数据不平衡对分类器的影响。数据预处理阶段应用合成少数类过采样技术(Synthetic Minority Oversampling Technique,SMOTE)降低数据不平衡度,每个少数类样本根据其相对于剩余样本的欧氏距离分配权重,使每个样本合成不同数量的新样本。算法改进阶段利用Kappa系数评价随机森林中决策树训练后的分类效果,并赋予每棵树相应的权重,使分类能力更好的树在投票阶段有更大的投票权,提高随机森林算法对不平衡数据的整体分类性能。在KEEL数据集上的实验表明,与未改进算法相比,改进后的算法对少数类样本分类准确率和整体样本分类性能有所提升。

关键词: 数据不平衡, 合成少数类过采样技术(SMOTE), Kappa系数, 随机森林