计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (14): 73-79.DOI: 10.3778/j.issn.1002-8331.2110-0312

• 理论与研发 • 上一篇    下一篇

融合网格划分和DBSCAN的改进聚类算法

孙璐,梁永全   

  1. 山东科技大学 计算机科学与工程学院,山东 青岛 266590
  • 出版日期:2022-07-15 发布日期:2022-07-15

Improved Clustering Algorithm Fusing Grid Partition and DBSCAN

SUN Lu, LIANG Yongquan   

  1. School of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, China
  • Online:2022-07-15 Published:2022-07-15

摘要: 针对基于密度的噪声应用空间聚类算法(density based spatial clustering of applications with noise,DBSCAN)计算复杂度较高以及无法聚类多密度数据集等问题,提出了一种网格聚类算法和DBSCAN相结合的融合聚类算法(G_FDBSCAN)。利用网格划分技术将数据集划分为稀疏区域和密集区域,分而治之,降低计算的时间复杂度和采用全局参数引起的聚类误差;改进传统的DBSCAN聚算法得到FDBSCAN,将密集区域中网格聚类的结果作为一个整体参与后续的聚类,在网格划分基础上进行邻域检索,减少邻域检索和类扩展过程中对象的无效查询和重复查询,进一步减少时间开销。理论分析和实验测试表明,改进后的算法与DBSCAN算法、DPC算法、KMEANS算法、BIRCH算法和CBSCAN算法相比,在聚类结果接近或达到最优的情况下,聚类效率分别平均提升了24倍、11倍、2倍、3倍和1倍。

关键词: 密度聚类, 网格聚类, 计算复杂度, 大规模数据集

Abstract: Aiming at the high computational complexity of density based spatial clustering of applications with noise(DBSCAN), as well the inability to cluster multi-density datasets, a fusion clustering algorithm(G_FDBSCAN) combining the grid clustering algorithm and DBSCAN is proposed. The new algorithm introduces grid division to divide the dataset into sparse areas and dense areas for processing respectively, so as to reduce the time complexity of calculation and the clustering error caused by global parameters. Then, it improves the traditional DBSCAN clustering algorithm to obtain FDBSCAN to take the results of grid clustering in dense areas as a whole to participate in the subsequent clustering, and carries out neighborhood retrieval on the basis of grid division, so as to reduce the invalid query and repeated query of objects in the process of neighborhood retrieval and class expansion, which further reduces the time overhead. Theoretical analysis and experimental tests show that compared with DBSCAN algorithm, DPC algorithm, KMEANS algorithm, BIRCH algorithm and CBSCAN algorithm, when the clustering results are optimal or close to, the clustering efficiency is increased by 24 times, 11times, 2 times, 3 times and1 time respectively.

Key words: density clustering, grid clustering, computational complexity, large spatial datasets