计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (20): 132-138.DOI: 10.3778/j.issn.1002-8331.1706-0340

• 模式识别与人工智能 • 上一篇    下一篇

基于近邻密度和半监督KNN的集成自训练方法

黎隽男,吕  佳   

  1. 重庆师范大学 计算机与信息科学学院,重庆 401331
  • 出版日期:2018-10-15 发布日期:2018-10-19

Integrated self-training method based on neighborhood density and semi-supervised KNN

LI Junnan, LV Jia   

  1. College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
  • Online:2018-10-15 Published:2018-10-19

摘要: 针对集成自训练算法随机初始化有标记样本容易在迭代中局部过拟合,不能很好地泛化到样本原始空间结构和集成自训练算法用WKNN分类器做数据剪辑时没有考虑到无标记样本对待测样本类别判定有影响的问题,提出结合近邻密度和半监督KNN的集成自训练算法。该算法用近邻密度方法选取初始化的已标注样本,避免已标注样本周围[k]个近邻样本成为已标注候选集。这样使初始化的已标注样本间的距离尽量分散,以更好地反应样本原始空间结构。同时在已标注样本候选集中选取密度最大的样本作为已标注样本。为了提高数据剪辑的性能,用半监督KNN代替WKNN,弥补WKNN做数据剪辑的时候只考虑到了有标记样本对待测样本类别的影响,而没有利用待测样本周围的无标记样本的问题,在UCI数据集上的对比实验验证了提出算法的有效性。

关键词: 集成自训练, 近邻密度, 半监督, K近邻(KNN)

Abstract: Integrated self-training algorithm is apt to locally overfit during iteration when it is used to randomly initialize labeled samples, which leads to poor generalization to the original sample space structure. Additionally, integrated self-training algorithm with WKNN classifier, which is adopted to edit data, doesn’t take into account the unlabeled samples’ effect on class labels of test samples. Thus, an integrated self-training algorithm based on nearest neighbor density and semi-supervised KNN is proposed in this paper. The algorithm uses the nearest neighbor density to select the initially labeled samples to avoid choosing K nearest neighbor samples around labeled samples as labeled sample candidates, so the distribution of the selective samples will be decentralized and it can better reflect the sample space structure. At the same time, in order to improve the performance of data clips, semi-supervised KNN is used into the algorithm instead of WKNN. It chooses the unlabeled samples with the highest density as the labeled samples so that it can make full use of unlabeled samples. The effectiveness of the presented algorithm is verified by comparative experiments on UCI datasets.

Key words: integrated self-training, nearest neighbor density, semi-supervised, K Nearest Neighbor(KNN)