计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (1): 46-52.DOI: 10.3778/j.issn.1002-8331.1901-0083

• 理论与研发 • 上一篇    下一篇

基于边界混合重采样的非平衡数据分类方法

侯贝贝,刘三阳,普事业   

  1. 西安电子科技大学 数学与统计学院,西安 710126
  • 出版日期:2020-01-01 发布日期:2020-01-02

Imbalanced Data Classification Method Based on Boundary Mixed Resampling

HOU Beibei, LIU Sanyang, PU Shiye   

  1. School of Mathematics and Statistics, Xidian University, Xi’an 710126, China
  • Online:2020-01-01 Published:2020-01-02

摘要: 在非平衡数据分类问题中,为了合成有价值的新样本和删除无影响的原样本,提出一种基于边界混合重采样的非平衡数据分类算法。该算法首先引入支持k-离群度概念,找出数据集中的边界点集和非边界点集;利用改进的SMOTE算法将少数类中的边界点作为目标样本合成新的点集,同时对多数类中的非边界点采用基于距离的欠采样算法,以此达到类之间的平衡。通过实验结果对比表明了该算法在保证G-mean值较优的前提下,一定程度上提高了少数类的分类精度。

关键词: 支持k-离群度, 重采样, 边界点, 非平衡数据分类

Abstract: In the problem of imbalanced data classification, aiming to synthesize valuable new samples and delete the original samples without any influence, a novel imbalanced data classification method based on boundary mixed resampling is proposed. Firstly, the concept of k-outlier is introduced to find out the boundary and non-boundary samples and then deal with them in different ways. The minority samples in boundary are taken as the target points to synthesize new sample points while the non-boundary majority ones are under sampled based on distance to achieve a basic balance of samples. By comparing the experimental results, it shows that the proposed algorithm achieves a better classification performance on the classification accuracy of minority samples to some extent on the premise of ensuring a better G-mean value.

Key words: k-outlier, resampling, boundary points, imbalanced data classification