计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (21): 144-149.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于概率的两层最近邻自适应度量分类算法

仝伯兵,王士同   

  1. 江南大学 数字媒体学院,江苏 无锡 214122
  • 出版日期:2015-11-01 发布日期:2015-11-16

Probability-based two-level nearest neighbor classification algorithm for adaptive distance

TONG Bobing, WANG Shitong   

  1. School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2015-11-01 Published:2015-11-16

摘要: 针对有限样本下,KNN算法距离量的选择以及以前距离量学习研究中没有充分考虑样本分布的情况,提出了一种新的基于概率的两层最近邻自适应度量算法(PTLNN)。该算法分为两层,在低层使用欧氏距离来确定一个未标记的样本局部子空间;在高层,用AdaBoost在子空间进行信息提取。以最小化平均绝对误差为原则,定义一个基于概率的自适应距离度量进行最近邻分类。该算法结合KNN与AdaBoost算法的优势,在有限样本下充分考虑样本分布能降低分类错误率,并且在噪声数据下有很好的稳定性,能降低AdaBoost过度拟合现象发生。通过与其他算法对比实验表明,PTLNN算法取得更好的结果。

关键词: 两层分类, 距离学习, 基于概率, AdaBoost, 平均绝对误差

Abstract: For finite samples, KNN heavily depends on an appropriate distance while distance learning on previous studies doesn’t fully consider the distribution of the sample. In this paper, a new probability-based two-level nearest neighbor classification algorithm for adaptive metrics(PTLNN) is proposed. This proposed algorithm is divided into two levels, it uses Euclidean distance in the low-level to determine an unlabeled sample local subspace set; at the high-level, it uses AdaBoost to extract information from subspace. With principle of minimizing the mean absolute error, it defines a probability-based adaptive distance nearest neighbor classifier. The proposed algorithm PTLNN combines the advantages of KNN and AdaBoost algorithm, fully considers the sample distributions in finite sample to reduce error rate, and has good stability in noisy data to reduce AdaBoost’s overfitting phenomenon. In contrast to other algorithms, experimental results show PTLNN can achieve better results.

Key words: two-level classification, metric learning, probability-based, AdaBoost, mean-absolute error