计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (24): 52-56.DOI: 10.3778/j.issn.1002-8331.1808-0423

• 大数据与云计算 • 上一篇    下一篇

DBSCAN算法研究及并行化实现

宋董飞,徐  华   

  1. 江南大学 物联网工程学院,江苏 无锡 214122
  • 出版日期:2018-12-15 发布日期:2018-12-14

Research and parallelization of DBSCAN algorithm

SONG Dongfei, XU Hua   

  1. School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2018-12-15 Published:2018-12-14

摘要: DBSCAN算法是一种基于密度的优秀算法,能够对任意形状的数据进行聚类,且能够识别噪声数据。为了减少人工对输入参数Eps和MinPts的干预,提出了一种新的计算Eps参数的方法;同时,为了解决传统单机DBSCAN算法在大数据环境下的性能问题,基于Spark框架实现了DBSCAN算法的并行化。通过实验表明,提出的DBSCAN改进算法具有很高的准确度和稳定性;并行实现的DBSCAN算法具有很好的并行性能,适合用于处理海量数据聚类。

关键词: 大数据, DBSCAN算法, Apache Spark, 分布式计算

Abstract: DBSCAN algorithm is an excellent algorithm based on density. It can cluster arbitrary shape data and recognize noise data. In order to reduce the intervention of the input parameters neighborhood radius Eps and Minimum number of Points(MinPts), a new method of calculating the Eps parameters is proposed. At the same time, in order to solve the performance problem of the traditional single machine DBSCAN algorithm in large data environment, the parallelization of the DBSCAN algorithm is realized based on the Spark framework. The experimental results show that the proposed DBSCAN algorithm has high accuracy and stability, and the parallel implementation of the DBSCAN algorithm has good parallel performance and is suitable for processing mass data clustering.

Key words: big data, DBSCAN, Apache Spark, distributed computing