计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (22): 40-45.DOI: 10.3778/j.issn.1002-8331.1809-0275

• 理论与研发 • 上一篇    下一篇

基于冗余度的KNN训练样本裁剪新算法

王子旗,何锦雯,蒋良孝   

  1. 中国地质大学(武汉) 计算机学院,武汉 430074
  • 出版日期:2019-11-15 发布日期:2019-11-13

New Redundancy-Based Algorithm for Reducing Amount of Training Examples in KNN

WANG Ziqi, HE Jinwen, JIANG Liangxiao   

  1. School of Computer Science, China University of Geosciences, Wuhan 430074, China
  • Online:2019-11-15 Published:2019-11-13

摘要: 作为数据挖掘领域十大算法之一,K-近邻算法(K-Nearest-Neighbor,KNN)因具有非参数、无需训练时间、简单有效等特点而得到广泛应用。然而,KNN算法在面对高维的大训练样本集时,分类时间复杂度高的问题成为其应用的瓶颈。另外,因训练样本的类分布不均匀而导致的类不平衡问题也会影响其分类性能。针对这两个问题,提出了一种基于冗余度的KNN分类器训练样本裁剪新算法(简记为RBKNN)。RBKNN通过引入训练样本集预处理过程,对每个训练样本进行冗余度计算,并随机裁剪掉部分高冗余度的训练样本,从而达到减小训练样本规模、均衡样本分布的目的。实验结果表明,RBKNN可在保持或改善分类精度的前提下显著提升KNN的分类效率。

关键词: KNN分类器, 样本裁剪, 快速分类, 类不平衡

Abstract: As one of the top 10 algorithms in data mining, the K-Nearest-Neighbor(KNN) algorithm is widely used because it is an non-parametric, simple and effective algorithm without training time. However, when it faces to massive amount of high-dimensional training examples, its high classification time complexity becomes a bottleneck of its application. In addition, its classification performance is often harmed, when the class distribution of training examples is skewed and the class imbalance problem occurs. To address these two issues, this paper proposes a new redundancy-based algorithm for reducing the amount of training examples (simply RBKNN). RBKNN at first computes the redundancy of each training example, and then randomly deletes some high redundant training examples by introducing a pre-processing process. RBKNN can not only reduce the size of training example set, but also make the class distribution of training examples more balanced. The experimental results show that RBKNN significantly promotes the efficiency of KNN, yet at the same time maintains or improves the classification accuracy of KNN.

Key words: KNN classifiers, example reduction, fast classification, class imbalance