Improved Clustering Algorithm Fusing Grid Partition and DBSCAN

doi:10.3778/j.issn.1002-8331.2110-0312

Abstract

Abstract: Aiming at the high computational complexity of density based spatial clustering of applications with noise（DBSCAN）, as well the inability to cluster multi-density datasets, a fusion clustering algorithm（G_FDBSCAN） combining the grid clustering algorithm and DBSCAN is proposed. The new algorithm introduces grid division to divide the dataset into sparse areas and dense areas for processing respectively, so as to reduce the time complexity of calculation and the clustering error caused by global parameters. Then, it improves the traditional DBSCAN clustering algorithm to obtain FDBSCAN to take the results of grid clustering in dense areas as a whole to participate in the subsequent clustering, and carries out neighborhood retrieval on the basis of grid division, so as to reduce the invalid query and repeated query of objects in the process of neighborhood retrieval and class expansion, which further reduces the time overhead. Theoretical analysis and experimental tests show that compared with DBSCAN algorithm, DPC algorithm, KMEANS algorithm, BIRCH algorithm and CBSCAN algorithm, when the clustering results are optimal or close to, the clustering efficiency is increased by 24 times, 11times, 2 times, 3 times and1 time respectively.

Key words: density clustering, grid clustering, computational complexity, large spatial datasets

摘要： 针对基于密度的噪声应用空间聚类算法（density based spatial clustering of applications with noise，DBSCAN）计算复杂度较高以及无法聚类多密度数据集等问题，提出了一种网格聚类算法和DBSCAN相结合的融合聚类算法（G_FDBSCAN）。利用网格划分技术将数据集划分为稀疏区域和密集区域，分而治之，降低计算的时间复杂度和采用全局参数引起的聚类误差；改进传统的DBSCAN聚算法得到FDBSCAN，将密集区域中网格聚类的结果作为一个整体参与后续的聚类，在网格划分基础上进行邻域检索，减少邻域检索和类扩展过程中对象的无效查询和重复查询，进一步减少时间开销。理论分析和实验测试表明，改进后的算法与DBSCAN算法、DPC算法、KMEANS算法、BIRCH算法和CBSCAN算法相比，在聚类结果接近或达到最优的情况下，聚类效率分别平均提升了24倍、11倍、2倍、3倍和1倍。

关键词: 密度聚类, 网格聚类, 计算复杂度, 大规模数据集

SUN Lu, LIANG Yongquan. Improved Clustering Algorithm Fusing Grid Partition and DBSCAN[J]. Computer Engineering and Applications, 2022, 58(14): 73-79.

孙璐, 梁永全. 融合网格划分和DBSCAN的改进聚类算法[J]. 计算机工程与应用, 2022, 58(14): 73-79.

References

[1] 胡世哲，娄铮铮，王若彬，等.一种双重加权的多视角聚类方法[J].计算机学报，2020，43（9）：1708-1720.
HU S Z，LOU Z Z，WANG R B，et al.Dual-weighted multi-view clustering[J].Chinese Journal of Computers，2020，43（9）：1708-1720.
[2] 柏锷湘，罗可，罗潇.结合自然和共享最近邻的密度峰值聚类算法[J].计算机科学与探索，2021，15（5）：931-940.
BAI E X，LUO K，LUO X.Peak density clustering algorithm combining natural and shared nearest neighbor[J].Journal of Frontiers of Computer Science and Technology，2021，15（5）：931-940.
[3] 田艳玲，张维桐，张锲石，等.图像场景分类技术综述[J].电子学报，2019，47（4）：915-926.
TIAN Y L，ZHANG W T，ZHANG Q S，et al.Review on image scene classification technology[J].Acta Automatica Sinica，2019，47（4）：915-926.
[4] 时光.基于机器学习的模式识别技术及其医学应用探索[D].济南：山东大学，2019.
SHI G.Pattern recognition based on machine learning and its implementations on clinical technologies[D].Jinan：Shandong University，2019.
[5] 朱吕行.面向生物医学文本及图谱的知识挖掘与知识发现[D].合肥：中国科学技术大学，2019.
ZHU L X.Knowledge mining and knowledge discovery for biomedical text and graph[D].Hefei：University of Science and Technology of China，2019.
[6] DANIEL B，WERNER D.Deep learning in bioinformatics and biomedicine[J].Briefings in Bioinformatics，2021，22（2）：1513-1514.
[7] 伍育红.聚类算法综述[J].计算机科学，2015，42（S1）：491-499.
WU Y H.General overview of clustering algorithms[J].Computer Science，2015，42（S1）：491-499.
[8] 田春子，杨万，杨德会，等.基于K-Means与DBSCAN聚类算法据背景下基于高校综合性数据的学生行为分析与研究[J].科学技术创新，2020（32）：91-93.
TIAN C Z，YANG W，YANG D H，et al.Analysis and research on student behavior based on comprehensive data of colleges and universities under the background of K-means and DBSCAN clustering algorithm[J].Scientific and Technological Innovation，2020（32）：91-93.
[9] 王光，林国宇.改进的自适应参数DBSCAN聚类算法[J].计算机工程与应用，2020，56（14）：45-51.
WANG G，LIN G Y.Improved adaptive parameter DBSCAN clustering algorithm[J].Computer Engineering and Applications，2020，56（14）：45-51.
[10] GHOLIZADEH N，SAADATFAR H，HANAFI N.K-DBSCAN：an improved DBSCAN algorithm for bigdata[J].The Journal of Supercomputing，2021，77（6）：6214-6235.
[11] JIN H D.Scalable model-based clustering algorithms for large databases and their applications[D].Hong Kong，China：The Chinese University of Hong Kong，2002.
[12] 秦佳睿，徐蔚鸿，马红华，等.自适应局部半径的DBSCAN聚类算法[J].小型微型计算机系统，2018，39（10）：2186-2190.
QIN J R，XU W H，MA H H，et al.Self-adaptive local eps DBSCAN[J].Journal of Chinese Computer Systems，2018，39（10）：2186-2190.
[13] FENG Z H，QIAN X Z，ZHAO N N.Greedy DBSCAN：an improved DBSCAN algorithm on multi-density clustering[J].Application Research of Computers，2016，33（9）：2693-2696.
[14] 于彦伟，贾召飞，曹磊，等.面向位置大数据的快速密度聚类算法[J].软件学报，2018，29（8）：2470-2484.
YU Y W，JIA Z F，CAO L，et al.Fast density-based clustering algorithm for location big data[J].Journal of Software，2018，29（8）：2470-2484.
[15] KUMAR K M，RAMA M.A fast DBSCAN clustering algorithm by accelerating neighbor searching using groups method[J].Pattern Recognition，2016，58（3）：39-48.
[16] 周水庚，周傲英，曹晶，等.一种基于密度的快速聚类算法[J].计算机研究与发展，2000（11）：8-13.
ZHOU S G，ZHOU A Y，CAO J，et al.A fast density-based clustering algorithm[J].Journal of Computer Research and Development，2000（11）：8-13.
[17] BORAH B，BHATTACHARYYA D K.An improved sampling-based DBSCAN for large spatial data bases[C]//International Conference on Intelligent Sensing & Information Processing，2004.
[18] PROKOPENKO A，LEBRUN-GRANDIE D，ARNDT D.Fast tree-based algorithms for DBSCAN on GPUs[J].arXiv：2103.05162，2021.
[19] LI S S.An improved DBSCAN algorithm based on the neighbor similarity and fast nearest neighbor query[J].IEEE Access，2020，8：47468-47476.
[20] SHIBLA T P，KUMAR K.Improving efficiency of DBSCAN by parallelizing KD-tree using spark[C]//2018 Second International Conference on Intelligent Computing and Control Systems（ICICCS），2018.
[21] 韩家炜，坎伯，裴健.数据挖掘：概念与技术[M].3版.北京：机械工业出版社，2012.
HAN J W，KAMBER M，PEI J.Data mining concepts and techniques[M].3rd ed.Beijing：China Machine Press，2012.
[22] HUBERT L，ARABIE P.Comparing partitions[J].Journal of Classification，1985，2（1）：193-218.
[23] VINH N X，EPPS J，BAILEY J.Information theoretic measures clusterings comparison：variants，properties，normali-
zation and correction for chance[J].Journal of Machine Learning Research，2010，11：2837-2854.