k-NN Density Dominator Component Delegations Based Density Peaks Clustering

doi:10.3778/j.issn.1002-8331.2302-0037

Abstract

Abstract: DPC（clustering by fast search and find of density peaks） is inefficient in processing large-scale clustering. [k]（lower case）-NN density dominator component skill can improve such shortcoming. However, representative data points in such skill could have poor ability on representation, which leads to lower clustering quality. The delegation sampling strategy can be used as an improvement on the above issue. The resulting new algorithm not only inherits the efficient characteristics of density dominator component acceleration skill, but also ensures the quality of clustering. This algorithm first constructs [k]-nearest neighbor graph. Then, kernel density is estimated and density dominator component is built. Thirdly, each density dominator component is sampled from its high and low density area, and similarity is computed between each dominator via delegations’ nearest neighbor relationship. Finally, DPC algorithm is conducted with each domain as the data point. The experiments show that the introduction of delegations strategy can improve the performance of the original DPC, and the clustering results are better than some other density clustering algorithms.

Key words: density peak clustering, [k]-nearest neighbor graph, density dominator component, delegations strategy, large-scale clustering

摘要： 密度峰值聚类（clustering by fast search and find of density peaks，DPC）算法在应对大规模聚类时效率不高。[k]近邻密度支配域小团簇加速技巧可以很好地改善该短板，但存在代表点代表能力不足的问题，从而影响聚类质量。代表团采样策略可作为上述问题的改进方式。由此形成的新算法不仅继承了原有密度支配域小团簇加速技巧的高效特性，还保证了聚类的质量。算法构建[k]近邻图。再利用[k]近邻图进行核密度估计并构建若干个密度支配域。对各密度支配域分别从高低密度区域采样支配域代表团。利用代表团的近邻关系计算域间相似度。将各支配域视为新样本点，执行DPC算法完成聚类。实验证明，引入代表团策略对DPC算法有一定的提升，聚类效果比部分密度聚类算法更好。

关键词: 密度峰值聚类, [k]近邻图, 密度支配域, 代表团策略, 大规模聚类

LYU Hongzhang, YANG Yiyang, YANG Geping, GONG Zhiguo. k-NN Density Dominator Component Delegations Based Density Peaks Clustering[J]. Computer Engineering and Applications, 2023, 59(24): 78-87.

吕鸿章, 杨易扬, 杨戈平, 巩志国. k近邻密度支配域代表团密度峰值聚类算法[J]. 计算机工程与应用, 2023, 59(24): 78-87.

References

[1] 徐晓，丁世飞，丁玲.密度峰值聚类算法研究进展[J].软件学报，2022，33（5）：1800-1816.
XU X，DING S F，DING L.Survey on density peaks clustering algorithm[J].Journal of Software，2022，33（5）：1800-1816.
[2] BHATTACHARJEE P，MITRA P.A survey of density based clustering algorithms[J].Frontiers of Computer Science，2021，15（1）：151308.
[3] COMANICIU D，MEER P.Mean shift：a robust approach toward feature space analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2002，24（5）：603-619.
[4] SHUKLA M，DHAMECHA M.A survey paper on mean shift algorithm to improve efficiency using blurring mean shift technique[C]//2019 International Conference on Smart Systems and Inventive Technology（ICSSIT），Tirunelveli，Nov 27-29，2019.Piscataway：IEEE，2019：343-348.
[5] SHEIKH Y A，KHAN E A，KANADE T.Mode-seeking by medoidshifts[C]//2007 IEEE 11th International Conference on Computer Vision，Rio de Janeiro，October 14-21，2007.Piscataway：IEEE，2007：1-8.
[6] VEDALDI A，SOATTO S.Quick shift and kernel methods for mode seeking[C]//10th European Conference on Computer Vision（ECCV 2008），Marseille，October 12-18，2008.Berlin：Springer，2008：705-718.
[7] JIANG H，JANG J，KPOTUFE S.Quickshift++：provably good initializations for sample-based mean shift[C]//Proceedings of the 35th International Conference on Machine Learning，Stockholmsm?ssan，July 10-15，2018.Cambridge MA：JMLR，2018：2294-2303.
[8] RODRIGUEZ A，LAIO A.Clustering by fast search and find of density peaks[J]，Science，2014，344（6191）：1492-1496.
[9] 陈叶旺，申莲莲，钟才明，等.密度峰值聚类算法综述[J].计算机研究与发展，2020，57（2）：378-394.
CHEN Y W，SHEN L L，ZHONG C M，et al.Survey on density peak clustering algorithm[J].Journal of Computer Research and Development，2020，57（2）：378-394.
[10] ZHANG Y，CHEN S，YU G.Efficient distributed density peaks for clustering large data sets in mapreduce[J].IEEE Transactions on Knowledge and Data Engineering，2016，28（12）：3218-3230.
[11] CHEN Y，HU X，FAN W，et al.Fast density peak clustering for large scale data based on knn[J].Knowledge-Based Systems，2020，187：104824.
[12] ZHENG X，REN C，YANG Y，et al.Quickdsc：clustering by quick density subgraph estimation[J].Information Sciences，2021，581：403-427.
[13] YANG G，LV H，YANG Y，et al.Fastdec：clustering by fast dominance estimation[C]//LNCS 13713：The European Conference on MachineLearning and Principles and Practice of Knowledge Discovery in Databases（ECML PKDD 2022），Grenoble，Sep 19-23，2022.Cham：Springer，2023：138-156.
[14] ARTHUR D，VASSILVITSKII S.K-means++：the advantages of careful seeding[C]//Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms（SODA’07），Society for Industrial and Applied Mathematics，Louisiana，January 7-9，2007.Philadelphia：SIAM，2007：1027-1035.
[15] CHENG Y.Mean shift，mode seeking，and clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，1995，17（8）：790-799.
[16] SCULLEY D.Web-scale k-means clustering[C]//Proceedings of the 19th International Conference on World Wide Web（WWW’10）.Raleigh，April 26-30，2010.New York：ACM，2010：1177-1178.
[17] CHEN X，CAI D.Large scale spectral clustering with landmark-based representation[C]//Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence（AAAI’11），San Francisco，August 7-11，2011.Cambridge：AAAI Press，2011：313-318.
[18] 张恒山，高宇坤，陈彦萍，等.基于群体智慧的簇连接聚类集成算法[J].计算机研究与发展，2018，55（12）：2611-2619.
ZHANG H S，GAO Y K，CHEN Y P，et al.Clustering ensemble algorithm with cluster connection based on wisdom of crowds[J].Journal of Computer Research and Development，2018，55（12）：2611-2619.
[19] 李康，何发智，陈晓，等.基于簇相似度的实时多尺度目标跟踪算法[J].模式识别与人工智能，2016，29（3）：229-239.
LI K，HE F Z，CHEN X，et al.Real-time multi-scale object tracking based on cluster similarity[J].Pattern Recognition and Artificial Intelligence，2016，29（3）：229-239.
[20] RAM P，SINHA K.Revisiting kd-tree for nearest neighbor search[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining（KDD’19），Anchorage AK，August 4-8，2019.New York：ACM，2019：1378-1388.
[21] SARFRAZ S，SHARMA V，STIEFELHAGEN R.Efficient parameter-free clustering using first neighbor relations[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），Long Beach，June 15-20，2019.Piscataway：IEEE，2019：8926-8935.
[22] LIU R，WANG H，YU X.Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J].Information Sciences，2018，450：200-226.
[23] The UCI machine learning repository.Center for machine learning and intelligent systems[EB/OL].（2022-12-02）[2023-01-31].http：//archive.ics.uci.edu/ml/datasets.phpaaa.
[24] LECUN Y，BOTTOU L，BENGIO Y.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE，1998，86（11）：2278-2324.
[25] FU L，MEDICO E.Flame，a novel fuzzy clustering method for the analysis of dna microarray data[J].BMC Bioinform，2007，8（1）：3.
[26] FRANTI P，VIRMAJOKI O.Iterative shrinking method for clustering problems[J].Pattern Recognition，2006，39（5）：761-775.
[27] VEENMAN C J，REINDERS M J T，BACKER E.A maximum variance cluster algorithm[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2002，24（9）：1273-1280.
[28] MYHRE J N，MIKALSEN K Y，LKSE S，et al.Robust clustering using a KNN mode seeking ensemble[J].Pattern Recognition，2018，76：491-505.
[29] VINH N X，EPPS J，BAILEY J.Information theoretic measures for clusterings comparison：is a correction for chance necessary?[C]//Proceedings of the 26th Annual International Conference on Machine Learning（ICML’09），Montreal，June 14-18，2009.New York：ACM，2009：1073-1080.