计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (2): 91-96.DOI: 10.3778/j.issn.1002-8331.1910-0218

• 理论与研发 • 上一篇    下一篇

利用自然最近邻的不平衡数据过采样方法

孟东霞,李玉鑑   

  1. 1.河北金融学院 金融科技学院,河北 保定 071051
    2.桂林电子科技大学 人工智能学院,广西 桂林 541004
  • 出版日期:2021-01-15 发布日期:2021-01-14

Oversampling Method for Unbalanced Data by Natural Nearest Neighbor

MENG Dongxia,LI Yujian   

  1. 1.School of Financial Technology, Hebei Finance University, Baoding, Hebei 071051, China
    2.School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China
  • Online:2021-01-15 Published:2021-01-14

摘要:

针对现有过采样方法存在的易引入噪声点、合成样本重叠的问题,提出一种基于自然最近邻的不平衡数据过采样方法。确定少数类样本的自然最近邻,每个样本的近邻个数由算法自适应计算生成,反映了样本分布的疏密程度。基于自然近邻关系对少数类样本聚类,由位于同一类簇中密集区域的核心点和稀疏区域的非核心点生成新样本。在二维合成数据集和UCI数据集上的对比实验验证了该方法的可行性和有效性,提高了不平衡数据的分类精度。

关键词: 不平衡数据集, 过采样, 自然最近邻, 聚类

Abstract:

Aiming at the problem of introducing noise points and synthesizing overlapping samples in existing oversampling methods, this paper proposes an oversampling method based on natural nearest neighbors. The proposed method firstly determines the natural nearest neighbor for minority samples. Each sample’s number of nearest neighbors is generated by adaptive calculation in the algorithm, which reflects the density of distribution. After cluster analysis for minority samples based on relations of natural neighbor, this method generates new samples using core points in dense area and non-core points in sparse area from the same cluster. The comparison experiments on a two-dimensional synthesis dataset and UCI datasets verify the feasibility and effectiveness of this method and improve the classification accuracy of unbalanced data.

Key words: imbalanced data set, over sampling;natural nearest neighbor, clustering