Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (12): 74-90.DOI: 10.3778/j.issn.1002-8331.2305-0502

• Theory, Research and Development • Previous Articles     Next Articles

Density Peaks Clustering Algorithm Based on Shared Neighbor Degree and Probability Assignment

ZHU Hongxiang, WU Genxiu, WANG Zhaohui   

  1. School of Mathematics and Statistics, Jiangxi Normal University, Nanchang 330022, China
  • Online:2024-06-15 Published:2024-06-14

基于共享邻近度和概率分配的密度峰值聚类算法

朱鸿祥,吴根秀,王兆辉   

  1. 江西师范大学 数学与统计学院,南昌 330022

Abstract: The density peaks clustering (DPC) algorithm has several issues such as difficulties in accurately finding cluster centers of manifold data, and easily produce the joint errors in the allocation of residual samples. To address these issues, this paper proposes a density peaks clustering algorithm based on shared neighbor degree and probability assignment (SP-DPC). Firstly, using K-nearest neighbor and shared K-nearest neighbor, the shared neighbor degree between sample points is defined, and the local density of sample points is redefined using this information to identify the correct cluster centers. Next, the transfer probability assignment strategy and the evidence probability assignment strategy are proposed to optimize the allocation of the residual sample together based on their K-nearest neighbors, thus avoiding the joint allocation errors. Finally, SP-DPC algorithm is compared to DPC, SKM-DPC, DPC-NN, DBSCAN, and K-means algorithms using 17 synthetic datasets and 12 UCI datasets. The experimental results show that the SP-DPC algorithm achieves relative optimal values in the three evaluation indexes of AMI, ARI and FMI as a whole and outperforms other algorithms in terms of clustering efficiency.

Key words: density peaks clustering, K nearest neighbor, shared neighbor degree, probability assignment, evidence theory

摘要: 针对密度峰值聚类(DPC)算法难以准确找到流形数据的类簇中心以及剩余样本点分配过程易发生连带错误等问题,提出了一种基于共享邻近度和概率分配的密度峰值聚类(SP-DPC)算法。基于[K]近邻和共享[K]近邻定义了样本点间的共享邻近度,使用共享邻近度重新定义了样本点的局部密度,从而找到正确的类簇中心;利用样本点的K近邻信息,提出传递概率分配策略和证据概率分配策略共同优化剩余样本点的分配,从而避免分配连带错误;在17个合成数据集和12个UCI数据集上进行实验,将SP-DPC算法与DPC算法、SKM-DPC算法、DPC-NN算法、DBSCAN算法、K-means算法进行对比,实验结果表明SP-DPC算法在AMI、ARI、FMI这3个评价指标上整体取得了相对最优值,聚类效果优于其他对比算法。

关键词: 密度峰值聚类, K近邻, 共享邻近度, 概率分配, 证据理论