Improved Oversampling and Random Forest Algorithm for Imbalanced Data

doi:10.3778/j.issn.1002-8331.1908-0338

Abstract

Abstract:

To solve the problem of low recognition rate for minority samples due to imbalanced data, an improved algorithm based on weighted oversampling and random forest is proposed to reduce the influence of imbalanced data on classifier. In data preprocessing step, weighted oversampling based on Synthetic Minority Oversampling Technique（SMOTE） is applied to reduce the data imbalanced rate. Weights are determined by the Euclidean distance between each sample and the rest in minority class, new samples with different number are generated by weighting samples of minority class. To improve the random forest, Kappa coefficient is used to evaluate the classification performance of decision tree, and corresponding weight is given to each tree. It makes trees with better performance having more voting rights at final voting stage. Experiments on KEEL datasets show that the proposed algorithm improves the classification accuracy for minority samples and the classification performance of the imbalanced datasets compared with unimproved algorithm.

Key words: imbalanced data, Synthetic Minority Oversampling Technique（SMOTE）, Kappa coefficient, random forest

摘要：

针对数据不平衡带来的少数类样本识别率低的问题，提出通过加权策略对过采样和随机森林进行改进的算法，从数据预处理和算法两个方面降低数据不平衡对分类器的影响。数据预处理阶段应用合成少数类过采样技术（Synthetic Minority Oversampling Technique，SMOTE）降低数据不平衡度，每个少数类样本根据其相对于剩余样本的欧氏距离分配权重，使每个样本合成不同数量的新样本。算法改进阶段利用Kappa系数评价随机森林中决策树训练后的分类效果，并赋予每棵树相应的权重，使分类能力更好的树在投票阶段有更大的投票权，提高随机森林算法对不平衡数据的整体分类性能。在KEEL数据集上的实验表明，与未改进算法相比，改进后的算法对少数类样本分类准确率和整体样本分类性能有所提升。

关键词: 数据不平衡, 合成少数类过采样技术（SMOTE）, Kappa系数, 随机森林

ZHANG Jiawei, GUO Linming, YANG Xiaomei. Improved Oversampling and Random Forest Algorithm for Imbalanced Data[J]. Computer Engineering and Applications, 2020, 56(11): 39-45.

张家伟，郭林明，杨晓梅. 针对不平衡数据的过采样和随机森林改进算法[J]. 计算机工程与应用, 2020, 56(11): 39-45.

[1]	YANG Yemin, ZHANG Huijun, ZHANG Xiaolong. Research on Interpretable Visual Analysis Method of Random Forest [J]. Computer Engineering and Applications, 2021, 57(6): 168-175.
[2]	XIONG Jian, QIN Renchao, HE Mengyi, LIU Jianlan, TANG Fengyang. Application of Improved Random Forest Algorithm in Android Malware Detection [J]. Computer Engineering and Applications, 2021, 57(3): 130-136.
[3]	CHEN Junfeng, ZHENG Zhongtuan. Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE [J]. Computer Engineering and Applications, 2021, 57(23): 106-112.
[4]	AN Lei, HAN Zhonghua, LIN Shuo, SHANG Wenli. Research on GAN-SDAE-RF Model for Network Intrusion Detection [J]. Computer Engineering and Applications, 2021, 57(21): 155-164.
[5]	MENG Dongxia，LI Yujian. Oversampling Method for Unbalanced Data by Natural Nearest Neighbor [J]. Computer Engineering and Applications, 2021, 57(2): 91-96.
[6]	WU Weijie, ZHANG Jingxiang. Random Forest Feature Selection Algorithm Based on Categorization Information and Application [J]. Computer Engineering and Applications, 2021, 57(17): 147-156.
[7]	YAN Zhengxu, QIN Chao, SONG Gang. Random Forest Model Stock Price Prediction Based on Pearson Feature Selection [J]. Computer Engineering and Applications, 2021, 57(15): 286-296.
[8]	ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework [J]. Computer Engineering and Applications, 2021, 57(13): 96-101.
[9]	WANG Junhong, GUO Yahui. Imbalanced Data Stream Classification Algorithm for Dynamic Data Chunk [J]. Computer Engineering and Applications, 2021, 57(13): 124-129.
[10]	WANG Caiwen, YANG Youlong. Improved Nearest Neighbor Classification Algorithm for Imbalanced Data [J]. Computer Engineering and Applications, 2020, 56(7): 30-38.
[11]	AN Weipeng, CHENG Xiaobo, LIU Yu. Application of Fleiss’ Kappa Coefficient in Bayesian Decision Tree Algorithm [J]. Computer Engineering and Applications, 2020, 56(7): 137-140.
[12]	ZHU Di, CHEN Danwei. Technology of Mobile Application Identification Based on Density-Based Clustering and Random Forest [J]. Computer Engineering and Applications, 2020, 56(4): 63-68.
[13]	HU Qingyu, LIU Guangchen. Application of Deep Belief Network in Recognition of Protein Coding Regions [J]. Computer Engineering and Applications, 2020, 56(4): 247-255.
[14]	XU Lingling, CHI Dongxiang. Machine Learning Classification Strategy for Imbalanced Data Sets [J]. Computer Engineering and Applications, 2020, 56(24): 12-27.
[15]	ZHANG Zhonglin, FENG Yibang, ZHAO Zhongkai. Oversampling Method for Unbalanced Data Sets Based on SVM [J]. Computer Engineering and Applications, 2020, 56(23): 220-228.

Improved Oversampling and Random Forest Algorithm for Imbalanced Data

针对不平衡数据的过采样和随机森林改进算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics