基于空间向量搜索的密度峰值聚类算法

doi:10.3778/j.issn.1002-8331.2204-0179

摘要/Abstract

摘要： 针对密度峰值聚类(DPC)算法因构建全局样本点间的相似度矩阵，而导致时间开销过大的问题，提出了一种基于空间向量搜索的密度峰值聚类(VS-DPC)算法。在[n]维正交坐标系中将数据点映射为以原点为起点的空间向量，计算向量的模和与统一坐标轴正方向间的夹角；利用截断距离和截断映射角确定相似范围搜索相似向量；利用相似向量确定有效密度点从而构建稀疏相似度矩阵，降低时间复杂度。在UCI数据库中7个真实数据集和7个形状复杂的人工数据集上的实验结果表明，所提的VS-DPC算法保持了DPC的聚类精度，相较DPC算法减少了约60%的时间开销。并对比于CDPC和GDPC两种提升DPC聚类效率的算法，算法参数更少，且在聚类精度和时间上分别平均提升6和18个百分点。

关键词: 密度峰值聚类, 稀疏矩阵, 时间复杂度, 向量搜索, 聚类

Abstract: A density peak clustering（VS-DPC） algorithm based on spatial vector search is proposed to address the problem of excessive time overhead due to the construction of the similarity matrix between global sample points in the density peak clustering（DPC） algorithm. Firstly, the data points are mapped into a spatial vector starting from the origin in an n-dimensional orthogonal coordinate system, and the modulus of the vector and the angle between it and the positive direction of the unified coordinate axis are calculated. Secondly, the similarity vector is searched using the truncation distance and truncation mapping angle to determine the similarity range. Finally, the similarity vector is used to determine the effective density points thus constructing a sparse similarity matrix to reduce the time complexity. The experimental results on seven real datasets and seven artificial datasets with complex shapes in the UCI database show that the proposed VS-DPC algorithm maintains the clustering accuracy of DPC and reduces the time overhead by about 60% compared to the DPC algorithm. And compared with CDPC and GDPC, two algorithms to improve the efficiency of DPC clustering, the algorithm has fewer parameters and improves the clustering accuracy and time by 6 and 18 percentage points on average, respectively.

Key words: density peak clustering, sparse matrix, time complexity, space vector search, clustering

马振明, 安俊秀. 基于空间向量搜索的密度峰值聚类算法[J]. 计算机工程与应用, 2023, 59(15): 123-131.

MA Zhenming, AN Junxiu. Density Peak Clustering Algorithm Based on Space Vector Search[J]. Computer Engineering and Applications, 2023, 59(15): 123-131.

参考文献

[1] RODRIGUEZ A，LAIO A.Clustering by fast search and find of density peaks[J].Science，2014，344（6191）：1492-1496.
[2] REN Y，WANG N，LI M，et al.Deep density-based image clustering[J].Knowledge-Based Systems，2020，197：105841.
[3] ZHENG J，WANG S，LI D，et al.Personalized recommendation based on hierarchical interest overlapping community[J].Information Sciences，2019，479：55-75.
[4] CHEN Y，HU X，FAN W，et al.Fast density peak clustering for large scale data based on kNN[J].Knowledge-Based Systems，2020，187：104824.
[5] WANG Y，WEI Z，YANG J.Feature trend extraction and adaptive density peaks search for intelligent fault diagnosis of machines[J].IEEE Transactions on Industrial Informatics，2018，15（1）：105-115.
[6] LI X，WONG K C.Evolutionary multiobjective clustering and its applications to patient stratification[J].IEEE Transactions on Cybernetics，2018，49（5）：1680-1693.
[7] ZHANG Z，ZHU Q，ZHU F，et al.Density decay graph-based density peak clustering[J].Knowledge-Based Systems，2021，224：107075.
[8] 孙林，秦小营，徐久成，等.基于K近邻和优化分配策略的密度峰值聚类算法[J].软件学报，2022，33（4）：1390-1411.
SUN L，QIN X Y，XU J C，et al.Density peak clustering algorithm based on K-nearest neighbors and optimized allocation strategy[J].Journal of Software，2022，33（4）：1390-1411.
[9] 盛锦超，杜明晶，李宇蕊，等.结合柯西核的分类型数据密度峰值聚类算法[J].计算机工程与应用，2022，58（18）：162-171.
SHENG J C，DU M J，LI Y R，et al.Cauchy kernel-based density peaks clustering algorithm for categorical data[J].Computer Engineering and Applications，2022，58（18）：162-171.
[10] 赵嘉，陈磊，吴润秀，等.K近邻和加权相似性的密度峰值聚类算法[J].控制理论与应用，2022（12）：2349-2357.
ZHAO J，CHEN L，WU R X，et al.Density peaking clustering algorithm with K-nearest neighbors and weighted similarity[J].Control Theory and Applications，2022（12）：2349-2357.
[11] 杜洁，马燕，黄慧.基于局部引力和距离的聚类算法[J].计算机应用，2022（5）：1472-1479.
DU J，MA Y，HUANG H.Clustering algorithm based on local gravity and distance[J].Journal of Computer Applications，2022（5）：1472-1479.
[12] XU X，DING S F，SUN T F，et al.Large-scale density peaks clustering algorithm based on grid screening[J].Journal of Computer Research and Development，2018，55（11）：2419.
[13] XU X，DING S，DU M，et al.DPCG：an efficient density peaks clustering algorithm based on grid[J].International Journal of Machine Learning and Cybernetics，2018，9（5）：743-754.
[14] XU X，DING S，SHI Z.An improved density peaks clustering algorithm with fast finding cluster centers[J].Knowledge-Based Systems，2018，158：65-74.
[15] 何仝，徐蔚鸿，马红华，等.一种基于密度峰值的高效分布式聚类算法[J].计算技术与自动化，2019，38（2）：64-71.
HE T，XU W H，MA H H，et al.An efficient distributed clustering algorithm based on peak density[J].Computing Technology and Automation，2019，38（2）：64-71.
[16] XU X，DING S，WANG Y，et al.A fast density peaks clustering algorithm with sparse search[J].Information Sciences，2021，554：61-83.
[17] VINH N X，EPPS J，BAILEY J.Information theoretic measures for clusterings comparison：variants，properties，normalization and correction for chance[J].The Journal of Machine Learning Research，2010，11：2837-2854.
[18] POWERS D M W.Evaluation：from precision，recall and F-measure to ROC，informedness，markedness and correlation[J].arXiv：2010.16061，2020.
[19] XIE J，GAO H，XIE W，et al.Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J].Information Sciences，2016，354：19-40.