计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (14): 156-160.DOI: 10.3778/j.issn.1002-8331.1905-0071

• 模式识别与人工智能 • 上一篇    下一篇

融合特征边界信息的不平衡数据过采样方法

孟东霞,李玉鑑   

  1. 1.河北金融学院 信息管理与工程系,河北 保定 071051
    2.北京工业大学 信息学部 计算机学院,北京 100124
  • 出版日期:2020-07-15 发布日期:2020-07-14

Oversampling Method for Unbalanced Data Based on Information of Characteristic Boundary

MENG Dongxia, LI Yujian   

  1. 1.Department of Information Management and Technology, Hebei Finance University, Baoding, Hebei 071051, China
    2.School of Computer, Information Department, Beijing University of Technology, Beijing 100124, China
  • Online:2020-07-15 Published:2020-07-14

摘要:

针对实际应用中存在的数据集分布不平衡的问题,提出一种融合特征边界数据信息的过采样方法。去除数据集中的噪声点,基于少数类样本点的多类近邻集合,融合特征边界的几何分布信息获得有利于定义最优非线性分类边界的少数类样本点,通过其与所属类簇的结合生成新样本。对不平衡数据集采用多种过采样技术处理后,利用支持向量机进行分类,对比实验表明所提方法有效改善了不平衡数据的分类精度,验证了算法的有效性。

关键词: 不平衡数据集, 分类, 过采样, 特征边界

Abstract:

Aiming at the problem of unbalanced distribution of datasets in practical application, this paper proposes a oversampling technique which combines feature boundary data information. The proposed method removes the noise points in the dataset firstly, then based on the majority near-neighbors sets of minority class samples and the geometric distribution information, some minority samples which are helpful to define optimal nonlinear classification boundary are chosen to generate new samples in combination with clusters of minority sets. After using various oversampling techniques to deal with unbalanced datasets, the classification is carried out by using Support Vector Machine(SVM), the comparison experiment shows that the proposed method effectively improves the classification accuracy of unbalanced data and verifies the validity of the algorithm.

Key words: imbalanced data set, classification, oversampling, characteristic boundary