Density Peaks Clustering Algorithm Based on Shared Neighbor Degree and Probability Assignment

doi:10.3778/j.issn.1002-8331.2305-0502

Abstract

Abstract: The density peaks clustering (DPC) algorithm has several issues such as difficulties in accurately finding cluster centers of manifold data, and easily produce the joint errors in the allocation of residual samples. To address these issues, this paper proposes a density peaks clustering algorithm based on shared neighbor degree and probability assignment (SP-DPC). Firstly, using K-nearest neighbor and shared K-nearest neighbor, the shared neighbor degree between sample points is defined, and the local density of sample points is redefined using this information to identify the correct cluster centers. Next, the transfer probability assignment strategy and the evidence probability assignment strategy are proposed to optimize the allocation of the residual sample together based on their K-nearest neighbors, thus avoiding the joint allocation errors. Finally, SP-DPC algorithm is compared to DPC, SKM-DPC, DPC-NN, DBSCAN, and K-means algorithms using 17 synthetic datasets and 12 UCI datasets. The experimental results show that the SP-DPC algorithm achieves relative optimal values in the three evaluation indexes of AMI, ARI and FMI as a whole and outperforms other algorithms in terms of clustering efficiency.

Key words: density peaks clustering, K nearest neighbor, shared neighbor degree, probability assignment, evidence theory

摘要： 针对密度峰值聚类（DPC）算法难以准确找到流形数据的类簇中心以及剩余样本点分配过程易发生连带错误等问题，提出了一种基于共享邻近度和概率分配的密度峰值聚类（SP-DPC）算法。基于[K]近邻和共享[K]近邻定义了样本点间的共享邻近度，使用共享邻近度重新定义了样本点的局部密度，从而找到正确的类簇中心；利用样本点的K近邻信息，提出传递概率分配策略和证据概率分配策略共同优化剩余样本点的分配，从而避免分配连带错误；在17个合成数据集和12个UCI数据集上进行实验，将SP-DPC算法与DPC算法、SKM-DPC算法、DPC-NN算法、DBSCAN算法、K-means算法进行对比，实验结果表明SP-DPC算法在AMI、ARI、FMI这3个评价指标上整体取得了相对最优值，聚类效果优于其他对比算法。

关键词: 密度峰值聚类, K近邻, 共享邻近度, 概率分配, 证据理论

ZHU Hongxiang, WU Genxiu, WANG Zhaohui. Density Peaks Clustering Algorithm Based on Shared Neighbor Degree and Probability Assignment[J]. Computer Engineering and Applications, 2024, 60(12): 74-90.

朱鸿祥, 吴根秀, 王兆辉. 基于共享邻近度和概率分配的密度峰值聚类算法[J]. 计算机工程与应用, 2024, 60(12): 74-90.

References

[1] JAIN A K, DUIN R P W, MAO J. Statistical pattern recognition: a review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(1): 4-37.
[2] DONG G, XIE M. Color clustering and learning for image segmentation based on neural networks[J]. IEEE Transactions on Neural Networks, 2005, 16(4): 925-936.
[3] DJENOURI Y, BELHADI A, FOURNIER-VIGER P, et al. Fast and effective cluster-based information retrieval using frequent closed itemsets[J]. Information Sciences, 2018, 453: 154-167.
[4] 陈薇, 袁文定, 方强, 等. 基于自适应卡尔曼滤波的Meanshift跟踪算法[J]. 制造业自动化, 2021, 43(6): 16-20.
CHEN W, YUAN W D, FANG Q, et al. Meanshift tracking algorithm based on adaptive Kalman filter[J]. Manufacturing Automation, 2021, 43(6): 16-20.
[5] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967: 281-297.
[6] GURRUTXAGA I, ALBISUA I, ARBELAITZ O, et al. EP/COP: an efficient method to find the best partition in hierarchical clustering based on a new cluster validity index[J]. Pattern Recognition, 2010, 43(10): 3364-3373.
[7] DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likely hood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1977, 39(1): 1-22.
[8] WANG W, YANG J, MUNTZ R. STING: a statistical information grid approach to spatial data mining[C]//Proceedings of the 23rd International Conference on Very Large Data Bases, 1997: 186-195.
[9] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996: 226-231.
[10] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
[11] DU M, DING S, JIA H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis[J]. Knowledge-Based Systems, 2016, 99: 135-145.
[12] LIU R, WANG H, YU X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Information Sciences, 2018, 450: 200-226.
[13] YU D, LIU G, GUO M, et al. Density peaks clustering based on weighted local density sequence and nearest neighbor assignment[J]. IEEE Access, 2019, 7: 34301-34317.
[14] 赵嘉, 姚占峰, 吕莉, 等. 基于相互邻近度的密度峰值聚类算法[J]. 控制与决策, 2021, 36(3): 543-552.
ZHAO J, YAO Z F, LYU L, et al. Density peaks clustering based on mutual neighbor degree[J]. Control and Decision, 2021, 36(3): 543-552.
[15] HOU J, ZHANG A, QI N. Density peak clustering based on relative density relationship[J]. Pattern Recognition, 2020, 108: 107554.
[16] 陈磊, 吴润秀, 李沛武, 等. 加权K近邻和多簇合并的密度峰值聚类算法[J]. 计算机科学与探索, 2022, 16(9): 2163-2176.
CHEN L, WU R X, LI P W, et al. Weighted k-nearest neighbors and multi-cluster merge density peaks clustering algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2163-2176.
[17] 马振明, 安俊秀, 周俊. 结合混合密度和局部结构的密度峰值聚类算法[J]. 计算机工程与应用, 2023, 59(12): 84-93.
MA Z M, AN J X, ZHOU J. Density peaking clustering algorithm combining hybrid density and local structure[J]. Computer Engineering and Applications, 2023, 59(12): 84-93.
[18] 陈蔚昌, 赵嘉, 肖人彬, 等. 面向密度分布不均数据的近邻优化密度峰值聚类算法[J]. 控制与决策, 2024, 39(3): 919-928.
CHEN W C, ZHAO J, XIAO R B, et al. Density peaks clustering algorithm with nearest neighbor optimization for data with uneven density distribution[J]. Control and Decsion, 2024, 39(3): 919-928.
[19] 张新元, 贠卫国. 共享K近邻和多分配策略的密度峰值聚类算法[J]. 小型微型计算机系统, 2023, 44(1): 75-82.
ZHANG X Y, YUN W G. Sharing K-nearest neighbors and multiple assignment policies density peaks clustering algorithm[J]. Journal of Chinese Computer Systems, 2023, 44(1): 75-82.
[20] GONG C, SU Z, WANG P, et al. Cumulative belief peaks evidential K-nearest neighbor clustering[J]. Knowledge-Based Systems, 2020, 200: 105982.
[21] DONG G, KUANG G. Target recognition via information aggregation through Dempster-Shafer’s evidence theory[J]. IEEE Geoscience and Remote Sensing Letters, 2015, 12(6): 1247-1251.
[22] DEMPSTER A P. Upper and lower probabilities induced by a multivalued mapping[J]. Annals of Mathematical Statistics, 1967, 38(2): 325-339.
[23] SHAFER G. A mathematical theory of evidence[M]. [S.l.]: Princeton University Press, 1976: 85-150.
[24] SMETS P, KENNES R. The transferable belief model[J]. Artificial Intelligence, 1994, 66(2): 191-234.
[25] CHANG H, YEUNG D Y. Robust path-based spectral clustering[J]. Pattern Recognition, 2008, 41(1): 191-203.
[26] DENOEUX T. A k?nearest neighbor classification rule based on Dempster-Shafer theory[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1995, 25(5): 804-813.
[27] VINH N, EPPS J, BAILEY J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance[J]. Journal of Machine Learning Research, 2010, 11(1): 2837-2854.
[28] FOWLKES E B, MALLOWS C L. A method for comparing two hierarchical clusterings[J]. Journal of the American Statistical Association, 1983, 78(383): 553-569.
[29] FU L, MEDICO E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data[J]. BMC Bioinformatics, 2007, 8(1): 1-15.
[30] GIONIS A, MANNILA H, TSAPARAS P. Clustering aggregation[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 4.
[31] VEENMAN C J, REINDERS M J T, BACKER E. A maximum variance cluster algorithm[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9): 1273-1280.
[32] FRANTI P, VIRMAJOKI O, HAUTAMAKI V. Fast agglomerative clustering using a k-nearest neighbor graph[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(11): 1875-1881.
[33] ZELNIK-MANOR L, PERONA P. Self-tuning spectral clustering[C]//Advances in Neural Information Processing Systems, 2004.
[34] CHENG D, ZHANG S, HUANG J. Dense members of local cores-based density peaks clustering algorithm[J]. Knowledge-Based Systems, 2020, 193: 105454.
[35] BAY S D, KIBLER D, PAZZANI M J, et al. The UCI KDD archive of large data sets for data mining research and experimentation[J]. ACM SIGKDD Explorations Newsletter, 2000, 2(2): 81-85.