计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (2): 70-76.DOI: 10.3778/j.issn.1002-8331.1912-0357

• 理论与研发 • 上一篇    下一篇

结合密度峰值和切边权值的自训练算法

卫丹妮,杨有龙,仇海全   

  1. 1.西安电子科技大学 数学与统计学院,西安 710071
    2.安徽科技学院 信息与网络工程学院,安徽 蚌埠 233030
  • 出版日期:2021-01-15 发布日期:2021-01-14

Self-Training Algorithm Combining Density Peak and Cut Edge Weight

WEI Danni, YANG Youlong, QIU Haiquan   

  1. 1.School of Mathematics and Statistics, Xidian University, Xi’an 710071, China
    2.College of Information & Network Engineering, Anhui Science and Technology University, Bengbu, Anhui 233030, China
  • Online:2021-01-15 Published:2021-01-14

摘要:

针对自训练迭代过程中错误标记样本对算法性能的影响,提出了基于密度峰值和切边权值的自训练算法。用密度聚类方法发现数据集的空间结构,选出具有代表性的未标记样本进行标签预测。用切边权值作为统计量进行假设检验,判断样本是否被正确标记,进而用正确标记样本逐步扩充有标记样本集合,直至所有未标记样本标签预测完成。新算法既充分利用了样本数据的空间结构信息,又解决了部分样本被标记错误的问题,提高了算法的分类准确率。通过在真实数据集上实验验证了新算法的有效性。

关键词: 自训练, 密度峰值, 切边权值, 假设检验

Abstract:

In view of the influence of mislabeled samples on the performance of self-training algorithm in the process of iteration, a self-training algorithm based on density peak and cut edge weight is proposed. Firstly, the representative unlabeled samples are selected for labels prediction by space structure, which is discovered by clustering method based on density of data. Secondly, cut edge weight is used as statistics to make hypothesis testing. This technique is for identifying whether samples are labeled correctly. And then the set of labeled data is gradually enlarged until all unlabeled samples are labeled. The proposed method not only makes full use of space structure information, but also solves the problem that some data may be classified incorrectly. Thus, the classification accuracy of algorithm is improved in a great measure. Extensive experiments on real datasets clearly illustrate the effectiveness of proposed method.

Key words: self-training, density peak, cut edge weight, hypothesis testing