计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (23): 220-228.DOI: 10.3778/j.issn.1002-8331.2006-0449

• 工程与应用 • 上一篇    下一篇

一种基于SVM的非均衡数据集过采样方法

张忠林,冯宜邦,赵中恺   

  1. 兰州交通大学 电子与信息工程学院,兰州 730070
  • 出版日期:2020-12-01 发布日期:2020-11-30

Oversampling Method for Unbalanced Data Sets Based on SVM

ZHANG Zhonglin, FENG Yibang, ZHAO Zhongkai   

  1. School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
  • Online:2020-12-01 Published:2020-11-30

摘要:

针对不平衡数据集分类结果偏向多数类的问题,重采样技术是解决此问题的有效方法之一。而传统过采样算法易合成无效样本,欠采样方法易剔除重要样本信息。基于此提出一种基于SVM的不平衡数据过采样方法SVMOM(Oversampling Method Based on SVM)。SVMOM通过迭代合成样本。在迭代过程中,通过SVM得到分类超平面;根据每个少数类样本到分类超平面的距离赋予样本距离权重;同时考虑少数类样本的类内平衡,根据样本的分布计算样本的密度,赋予样本密度权重;依据样本的距离权重和密度权重计算每个少数类样本的选择权重,根据样本的选择权重选择样本运用SMOTE合成新样本,达到平衡数据集的目的。实验结果表明,提出的算法在一定程度上解决了分类结果偏向多数类的问题,验证了算法的有效性。

关键词: 不平衡数据, 支持向量机(SVM), 过采样, 样本权重, 合成少数类过采样技术(SMOTE)

Abstract:

Aiming at the problem that the classification results of imbalanced data sets are biased towards the majority class, resampling technology is one of the effective methods to solve this problem. However, traditional oversampling algorithms are easy to synthesize invalid samples, and undersampling methods are easy to eliminate important sample information. Based on this, an Oversampling Method based on SVM(SVMOM) is proposed. SVMOM synthesizes samples through iteration. In the iterative process, the classification hyperplane is first obtained by SVM. Secondly, the sample distance weight is assigned according to the distance of each minority sample to the classification hyperplane. While considering the intraclass balance of the minority sample, the sample density is calculated according to the distribution of the sample. It gives the sample density weight. Then it calculates the selection weight of each minority sample according to the distance weight and density weight of the sample, and finally it selects the sample according to the sample selection weight and uses SMOTE to synthesize a new sample to achieve the purpose of balancing the data set. The experimental results show that the algorithm proposed in this paper solves the problem that the classification results are biased towards the majority class to a certain extent, and verifies the effectiveness of the algorithm.

Key words: imbalanced data, Support Vector Machine(SVM), over-sampling, sample weight, Synthetic Minority Over-sampling Technique(SMOTE)