Oversampling Method for Unbalanced Data Sets Based on SVM

doi:10.3778/j.issn.1002-8331.2006-0449

Abstract

Abstract:

Aiming at the problem that the classification results of imbalanced data sets are biased towards the majority class, resampling technology is one of the effective methods to solve this problem. However, traditional oversampling algorithms are easy to synthesize invalid samples, and undersampling methods are easy to eliminate important sample information. Based on this, an Oversampling Method based on SVM（SVMOM） is proposed. SVMOM synthesizes samples through iteration. In the iterative process, the classification hyperplane is first obtained by SVM. Secondly, the sample distance weight is assigned according to the distance of each minority sample to the classification hyperplane. While considering the intraclass balance of the minority sample, the sample density is calculated according to the distribution of the sample. It gives the sample density weight. Then it calculates the selection weight of each minority sample according to the distance weight and density weight of the sample, and finally it selects the sample according to the sample selection weight and uses SMOTE to synthesize a new sample to achieve the purpose of balancing the data set. The experimental results show that the algorithm proposed in this paper solves the problem that the classification results are biased towards the majority class to a certain extent, and verifies the effectiveness of the algorithm.

Key words: imbalanced data, Support Vector Machine（SVM）, over-sampling, sample weight, Synthetic Minority Over-sampling Technique（SMOTE）

摘要：

针对不平衡数据集分类结果偏向多数类的问题，重采样技术是解决此问题的有效方法之一。而传统过采样算法易合成无效样本，欠采样方法易剔除重要样本信息。基于此提出一种基于SVM的不平衡数据过采样方法SVMOM（Oversampling Method Based on SVM）。SVMOM通过迭代合成样本。在迭代过程中，通过SVM得到分类超平面；根据每个少数类样本到分类超平面的距离赋予样本距离权重；同时考虑少数类样本的类内平衡，根据样本的分布计算样本的密度，赋予样本密度权重；依据样本的距离权重和密度权重计算每个少数类样本的选择权重，根据样本的选择权重选择样本运用SMOTE合成新样本，达到平衡数据集的目的。实验结果表明，提出的算法在一定程度上解决了分类结果偏向多数类的问题，验证了算法的有效性。

关键词: 不平衡数据, 支持向量机（SVM）, 过采样, 样本权重, 合成少数类过采样技术（SMOTE）

ZHANG Zhonglin, FENG Yibang, ZHAO Zhongkai. Oversampling Method for Unbalanced Data Sets Based on SVM[J]. Computer Engineering and Applications, 2020, 56(23): 220-228.

张忠林，冯宜邦，赵中恺. 一种基于SVM的非均衡数据集过采样方法[J]. 计算机工程与应用, 2020, 56(23): 220-228.

[1]	HAN Weiyu, CHENG Longsheng. Research on Roling Bearing Failure Mode Classification Based on MTS and SVM [J]. Computer Engineering and Applications, 2021, 57(6): 239-246.
[2]	WEN Jiebin, YANG Wenzhong, MA Guoxiang, ZHANG Zhihao, LI Hailei. Micro-expression Recognition Based on Apex Frame Optical Flow and Convolutional Autoencoder [J]. Computer Engineering and Applications, 2021, 57(4): 127-133.
[3]	LI Junxia, ZHANG Qin, ZHENG Guimei. Overview of Human Posture Recognition by Ultra-wideband Radar [J]. Computer Engineering and Applications, 2021, 57(3): 14-23.
[4]	XU Xianfeng, CAI Lulu, ZHANG Li. Photovoltaic Power Generation Prediction Algorithm Based on MLP and DBN [J]. Computer Engineering and Applications, 2021, 57(3): 266-272.
[5]	CHEN Junfeng, ZHENG Zhongtuan. Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE [J]. Computer Engineering and Applications, 2021, 57(23): 106-112.
[6]	WANG Le, HAN Meng, LI Xiaojuan, ZHANG Ni, CHENG Haodong. Review of Classification Methods for Unbalanced Data Sets [J]. Computer Engineering and Applications, 2021, 57(22): 42-52.
[7]	CHEN Fujian, XIE Weixin, XIA Ting. Adaptive Anti-occlusion Target Tracking Algorithm Based on LCT+ [J]. Computer Engineering and Applications, 2021, 57(22): 190-198.
[8]	MENG Dongxia，LI Yujian. Oversampling Method for Unbalanced Data by Natural Nearest Neighbor [J]. Computer Engineering and Applications, 2021, 57(2): 91-96.
[9]	ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework [J]. Computer Engineering and Applications, 2021, 57(13): 96-101.
[10]	WANG Junhong, GUO Yahui. Imbalanced Data Stream Classification Algorithm for Dynamic Data Chunk [J]. Computer Engineering and Applications, 2021, 57(13): 124-129.
[11]	WANG Caiwen, YANG Youlong. Improved Nearest Neighbor Classification Algorithm for Imbalanced Data [J]. Computer Engineering and Applications, 2020, 56(7): 30-38.
[12]	CHEN Feiyu, YUE Wenbin, RAO Yinglu, XING Jinhao, MA Xiaojing. Autonomous Precision Landing of Drone Based on Improved TLD Algorithm [J]. Computer Engineering and Applications, 2020, 56(7): 247-254.
[13]	MA Ling, LUO Xiaoshu, JIANG Pinqun. Research on Dot Matrix Character Recognition Based on Template Matching and Support Vector Machine [J]. Computer Engineering and Applications, 2020, 56(4): 134-139.
[14]	XU Lingling, CHI Dongxiang. Machine Learning Classification Strategy for Imbalanced Data Sets [J]. Computer Engineering and Applications, 2020, 56(24): 12-27.
[15]	HUANG Guangjun, DENG Yuanlong. Polarizer Visual Defect Detection and Classification Based on Improved LBP and SVM Algorithm [J]. Computer Engineering and Applications, 2020, 56(22): 251-255.

Oversampling Method for Unbalanced Data Sets Based on SVM

一种基于SVM的非均衡数据集过采样方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics