计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (15): 123-131.DOI: 10.3778/j.issn.1002-8331.2204-0179

• 理论与研发 • 上一篇    下一篇

基于空间向量搜索的密度峰值聚类算法

马振明,安俊秀   

  1. 成都信息工程大学 软件工程学院, 成都 610200
  • 出版日期:2023-08-01 发布日期:2023-08-01

Density Peak Clustering Algorithm Based on Space Vector Search

MA Zhenming, AN Junxiu   

  1. College of Software Engineering, Chengdu University of Information Technology, Chengdu 610200, China
  • Online:2023-08-01 Published:2023-08-01

摘要: 针对密度峰值聚类(DPC)算法因构建全局样本点间的相似度矩阵,而导致时间开销过大的问题,提出了一种基于空间向量搜索的密度峰值聚类(VS-DPC)算法。在[n]维正交坐标系中将数据点映射为以原点为起点的空间向量,计算向量的模和与统一坐标轴正方向间的夹角;利用截断距离和截断映射角确定相似范围搜索相似向量;利用相似向量确定有效密度点从而构建稀疏相似度矩阵,降低时间复杂度。在UCI数据库中7个真实数据集和7个形状复杂的人工数据集上的实验结果表明,所提的VS-DPC算法保持了DPC的聚类精度,相较DPC算法减少了约60%的时间开销。并对比于CDPC和GDPC两种提升DPC聚类效率的算法,算法参数更少,且在聚类精度和时间上分别平均提升6和18个百分点。

关键词: 密度峰值聚类, 稀疏矩阵, 时间复杂度, 向量搜索, 聚类

Abstract: A density peak clustering(VS-DPC) algorithm based on spatial vector search is proposed to address the problem of excessive time overhead due to the construction of the similarity matrix between global sample points in the density peak clustering(DPC) algorithm. Firstly, the data points are mapped into a spatial vector starting from the origin in an n-dimensional orthogonal coordinate system, and the modulus of the vector and the angle between it and the positive direction of the unified coordinate axis are calculated. Secondly, the similarity vector is searched using the truncation distance and truncation mapping angle to determine the similarity range. Finally, the similarity vector is used to determine the effective density points thus constructing a sparse similarity matrix to reduce the time complexity. The experimental results on seven real datasets and seven artificial datasets with complex shapes in the UCI database show that the proposed VS-DPC algorithm maintains the clustering accuracy of DPC and reduces the time overhead by about 60% compared to the DPC algorithm. And compared with CDPC and GDPC, two algorithms to improve the efficiency of DPC clustering, the algorithm has fewer parameters and improves the clustering accuracy and time by 6 and 18 percentage points on average, respectively.

Key words: density peak clustering, sparse matrix, time complexity, space vector search, clustering