Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (7): 36-43.DOI: 10.3778/j.issn.1002-8331.1801-0013

Previous Articles     Next Articles

Improved density peaks clustering algorithm combining K-Nearest Neighbors

XUE Xiaona1, GAO Shuping1, PENG Hongming2, WU Huihui1   

  1. 1. School of Mathematics and Statistics, Xidian University, Xi’an 710126, China
    2. School of Telecommunications Engineering, Xidian University, Xi’an 710071, China
  • Online:2018-04-01 Published:2018-04-16



  1. 1.西安电子科技大学 数学与统计学院,西安 710126
    2.西安电子科技大学 通信工程学院,西安 710071

Abstract: Concerning the problem that Density Peaks Clustering(DPC) algorithm has poor performance on the datasets with high dimension, noise and complex structure, an Improved Density Peaks Clustering Algorithm(IDPCA) combining K-Nearest Neighbors is proposed. Firstly, a new definition of local density is proposed to describe the distribution of the spatial samples. Secondly, the concept of core point is introduced and a global search allocation strategy is designed based on K-Nearest Neighbors thought to classify the unassigned K-Nearest Neighbors of core points correctly, which accelerates the clustering speed. Thirdly, a statistical learning allocation strategy is developed, by using the weighted K-Nearest Neighbors’ information of the unassigned points to calculate the probability of them being assigned to each local cluster, which improves the clustering quality effectively. Finally, compared with DPC and other three classical clustering methods on 21 test datasets including synthetic and real-world datasets, the experimental results show that IDPCA outperforms them on four different evaluation indexes.

Key words: data mining, clustering algorithm, local density, density peaks, K-Nearest Neighbors

摘要: 针对密度峰值聚类算法(DPC)在处理维数较高、含噪声及结构复杂数据集时聚类性能不佳问题,提出一种结合K近邻的改进密度峰值聚类算法(IDPCA)。该算法首先给出新的局部密度度量方法来描述每个样本在空间中的分布情况,然后引入核心点的概念并结合K近邻思想设计了全局搜索分配策略,通过不断将核心点的未分配K近邻正确归类以加快聚类速度,进而提出一种基于K近邻加权的统计学习分配策略,利用剩余点的K近邻加权信息来确定其被分配到各局部类的概率,有效提高了聚类质量。实验结果表明,IDPCA算法在21个典型的测试数据集上均有良好的适用性,而在与DPC算法及另外3种典型聚类算法的性能指标对比上,其优势更为明显。

关键词: 数据挖掘, 聚类算法, 局部密度, 密度峰值, K近邻