计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (2): 78-85.DOI: 10.3778/j.issn.1002-8331.2012-0476

• 理论与研发 • 上一篇    下一篇

多密度自适应确定DBSCAN算法参数的算法研究

万佳,胡大裟,蒋玉明   

  1. 1.四川大学 计算机学院,成都 610065
    2.四川省大数据分析与融合应用技术工程实验室,成都 610065
  • 出版日期:2022-01-15 发布日期:2022-01-18

Research on Method of Multi-density Self-Adaptive Determination of DBSCAN Algorithm Parameters

WAN Jia, HU Dasha, JIANG Yuming   

  1. 1.College of Computer Science, Sichuan University, Chengdu 610065, China
    2.Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
  • Online:2022-01-15 Published:2022-01-18

摘要: DBSCAN算法的[Eps]和[MinPts]参数需要人为设定,取值不当会导致聚类结果准确度不高,且在密度分布差异大的数据集上,由于参数的全局性,错误地应用于不同密度的簇,导致不能正确地发现簇。针对以上问题,提出一种多密度自适应参数确定算法,利用经过去噪衰减后的数据集的自身分布特性生成候选[Eps]和[MinPts]参数列表,并在簇数趋于稳定的区间内根据去噪级别选取对应的[Eps]和[MinPts]作为初始密度阈值。对在该密度阈值条件下聚类产生的噪声数据使用同样的方法生成候选参数列表,选取最优参数,得到新密度阈值,循环该步骤直到噪声数据的数量或密度阈值低于一定程度为止。将不同密度阈值下的聚类结果进行合并。实验结果表明,该算法能够自适应地选取合适的多密度阈值,并在密度分布差异大的数据集上有很好的聚类效果。

关键词: DBSCAN算法, 去噪衰减, 多密度阈值, 自适应

Abstract: The parameters of [Eps] and [MinPts] in DBSCAN algorithm need to be set artificially. Improper values will result in low accuracy of clustering results. Moreover, due to the global nature of the parameters, they are mistakenly applied to clusters with different densities, causing the cluster not to be found correctly. To solve above problems, this paper puts forward a kind of multi-density adaptive parameter determination algorithm, the algorithm first generates candidate [Eps] and [MinPts] parameter lists by using the self-distribution characteristics of the denoising attenuated data set, and in the interval where the number of clusters tends to be stable, corresponding [Eps] and [MinPts] are selected as the initial density threshold according to the denoising level. Then, the candidate parameter list is generated by using the same method for noise data generated by clustering under the condition of the density threshold, and the optimal parameter is selected to obtain the new density threshold. The step is cycled until the noise data quantity or density threshold is lower than a certain degree. Finally, the clustering results of different density thresholds are combined to obtain the final clustering result. The experimental results show that the algorithm can adaptively select the appropriate multi-density threshold and has a good clustering effect on the data sets with large density distribution differences.

Key words: DBSCAN algorithm, denoising attenuation, multi-density threshold, self-adaptive