基于冗余度的KNN训练样本裁剪新算法

doi:10.3778/j.issn.1002-8331.1809-0275

计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (22): 40-45.DOI: 10.3778/j.issn.1002-8331.1809-0275

基于冗余度的KNN训练样本裁剪新算法

王子旗，何锦雯，蒋良孝

中国地质大学（武汉）计算机学院，武汉 430074

出版日期:2019-11-15 发布日期:2019-11-13

New Redundancy-Based Algorithm for Reducing Amount of Training Examples in KNN

WANG Ziqi, HE Jinwen, JIANG Liangxiao

School of Computer Science, China University of Geosciences, Wuhan 430074, China

Online:2019-11-15 Published:2019-11-13

摘要/Abstract

摘要： 作为数据挖掘领域十大算法之一，K-近邻算法（K-Nearest-Neighbor，KNN）因具有非参数、无需训练时间、简单有效等特点而得到广泛应用。然而，KNN算法在面对高维的大训练样本集时，分类时间复杂度高的问题成为其应用的瓶颈。另外，因训练样本的类分布不均匀而导致的类不平衡问题也会影响其分类性能。针对这两个问题，提出了一种基于冗余度的KNN分类器训练样本裁剪新算法（简记为RBKNN）。RBKNN通过引入训练样本集预处理过程，对每个训练样本进行冗余度计算，并随机裁剪掉部分高冗余度的训练样本，从而达到减小训练样本规模、均衡样本分布的目的。实验结果表明，RBKNN可在保持或改善分类精度的前提下显著提升KNN的分类效率。

关键词: KNN分类器, 样本裁剪, 快速分类, 类不平衡

Abstract: As one of the top 10 algorithms in data mining, the K-Nearest-Neighbor（KNN） algorithm is widely used because it is an non-parametric, simple and effective algorithm without training time. However, when it faces to massive amount of high-dimensional training examples, its high classification time complexity becomes a bottleneck of its application. In addition, its classification performance is often harmed, when the class distribution of training examples is skewed and the class imbalance problem occurs. To address these two issues, this paper proposes a new redundancy-based algorithm for reducing the amount of training examples （simply RBKNN）. RBKNN at first computes the redundancy of each training example, and then randomly deletes some high redundant training examples by introducing a pre-processing process. RBKNN can not only reduce the size of training example set, but also make the class distribution of training examples more balanced. The experimental results show that RBKNN significantly promotes the efficiency of KNN, yet at the same time maintains or improves the classification accuracy of KNN.

Key words: KNN classifiers, example reduction, fast classification, class imbalance

王子旗，何锦雯，蒋良孝. 基于冗余度的KNN训练样本裁剪新算法[J]. 计算机工程与应用, 2019, 55(22): 40-45.

WANG Ziqi, HE Jinwen, JIANG Liangxiao. New Redundancy-Based Algorithm for Reducing Amount of Training Examples in KNN[J]. Computer Engineering and Applications, 2019, 55(22): 40-45.

[1]	易成岐1，黄倩倩1，王从余2，张何灿3，靳晓锟4，王建冬1. 面向类不平衡问题的“职业举报人”识别方法[J]. 计算机工程与应用, 2019, 55(14): 1-7.
[2]	廖胜平，徐玲，鄢萌. 基于采样的半监督支持向量机软件缺陷预测方法[J]. 计算机工程与应用, 2017, 53(14): 161-166.
[3]	乔健，田庆. 利用最近邻信息快速分类多标签数据[J]. 计算机工程与应用, 2011, 47(32): 138-140.
[4]	张珍珍，董才林，陈增照，何秀玲. 改进的结合密度聚类的SVM快速分类方法[J]. 计算机工程与应用, 2011, 47(2): 136-138.
[5]	芮瑞，鲍长春. 噪声环境下的窄带音频信号快速分类方法[J]. 计算机工程与应用, 2011, 47(16): 22-25.
[6]	杨林波¹,王士同^1,2. 基于边界可信度相似的快速文本分类方法[J]. 计算机工程与应用, 2009, 45(4): 156-158.
[7]	张磊李弼程刘安斐. 基于多特征和证据理论离线签名识别[J]. 计算机工程与应用, 2007, 43(8期): 234-237.

基于冗余度的KNN训练样本裁剪新算法

New Redundancy-Based Algorithm for Reducing Amount of Training Examples in KNN

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics