Research and parallelization of DBSCAN algorithm

doi:10.3778/j.issn.1002-8331.1808-0423

Abstract

Abstract: DBSCAN algorithm is an excellent algorithm based on density. It can cluster arbitrary shape data and recognize noise data. In order to reduce the intervention of the input parameters neighborhood radius Eps and Minimum number of Points（MinPts）, a new method of calculating the Eps parameters is proposed. At the same time, in order to solve the performance problem of the traditional single machine DBSCAN algorithm in large data environment, the parallelization of the DBSCAN algorithm is realized based on the Spark framework. The experimental results show that the proposed DBSCAN algorithm has high accuracy and stability, and the parallel implementation of the DBSCAN algorithm has good parallel performance and is suitable for processing mass data clustering.

Key words: big data, DBSCAN, Apache Spark, distributed computing

摘要： DBSCAN算法是一种基于密度的优秀算法，能够对任意形状的数据进行聚类，且能够识别噪声数据。为了减少人工对输入参数Eps和MinPts的干预，提出了一种新的计算Eps参数的方法；同时，为了解决传统单机DBSCAN算法在大数据环境下的性能问题，基于Spark框架实现了DBSCAN算法的并行化。通过实验表明，提出的DBSCAN改进算法具有很高的准确度和稳定性；并行实现的DBSCAN算法具有很好的并行性能，适合用于处理海量数据聚类。

关键词: 大数据, DBSCAN算法, Apache Spark, 分布式计算

SONG Dongfei, XU Hua. Research and parallelization of DBSCAN algorithm[J]. Computer Engineering and Applications, 2018, 54(24): 52-56.

宋董飞，徐华. DBSCAN算法研究及并行化实现[J]. 计算机工程与应用, 2018, 54(24): 52-56.

[1]	WU Hao, XU Xingjian, MENG Fanjun. Knowledge Graph-Assisted Multi-task Feature-Based Course Recommendation Algorithm [J]. Computer Engineering and Applications, 2021, 57(21): 132-139.
[2]	WU Dongyang, DOU Jianping, LI Jun. Design of Digital Twin System for Quadrotor [J]. Computer Engineering and Applications, 2021, 57(16): 237-244.
[3]	ZHU Di, CHEN Danwei. Technology of Mobile Application Identification Based on Density-Based Clustering and Random Forest [J]. Computer Engineering and Applications, 2020, 56(4): 63-68.
[4]	LI Ling, GU Xiaomei, LIU Zihao. Application Research of Multi-subdomain Random Forest in Context-Aware Recommendation [J]. Computer Engineering and Applications, 2020, 56(22): 132-141.
[5]	WANG Yonggui, GUO Xintong. Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql [J]. Computer Engineering and Applications, 2020, 56(21): 72-78.
[6]	WANG Liang, YE Jimin. Hybrid Algorithm of DBSCAN and Improved SMOTE for Oversampling [J]. Computer Engineering and Applications, 2020, 56(18): 111-118.
[7]	ZHANG Meng, SUN Bingzhen, CHU Xiaoli. Gout Diagnosis Model Based on Neighborhood Cost Sensitive Three-Way Decision [J]. Computer Engineering and Applications, 2020, 56(16): 218-225.
[8]	WANG Guang, LIN Guoyu. Improved Adaptive Parameter DBSCAN Clustering Algorithm [J]. Computer Engineering and Applications, 2020, 56(14): 45-51.
[9]	WU Yangyang, TANG Jianguo. Research Progress of Attribute Reduction Based on Rough Set in Context of Big Data [J]. Computer Engineering and Applications, 2019, 55(6): 31-38.
[10]	LI Wenjie, YAN Shiqiang, JIANG Ying, ZHANG Songzhi, WANG Chengliang. Research on Method of Self-Adaptive Determination of DBSCAN Algorithm Parameters [J]. Computer Engineering and Applications, 2019, 55(5): 1-7.
[11]	WANG Jingyu, LUAN Junqing, TAN Yuesheng. Research on Big Data Access Control Model Based on Data Sensitivity [J]. Computer Engineering and Applications, 2019, 55(23): 70-77.
[12]	HOU Yu1，2, QIN Xiaolin2, PENG Haoyue1，2, ZHANG Lige1，2. Feature Selection Based on Global Pitch Adjusting Harmony Search Algorithm [J]. Computer Engineering and Applications, 2019, 55(2): 21-27.
[13]	WANG Dexian, HE Xianbo, HE Chunlin, ZHOU Kun, CHEN Minzhi. Latent Factor Prediction Model Combining L1 and L2 Regularization Constraints [J]. Computer Engineering and Applications, 2019, 55(19): 121-127.
[14]	WANG Yuan, PENG Chenhui, WANG Zhiqiang, FAN Qiang, YAO Yiyang, HUA Zhaoyun. Application of Knowledge Graph in Full-Service Unified Data Center of National Grid [J]. Computer Engineering and Applications, 2019, 55(15): 104-109.
[15]	LI Yufan1, ZHANG Huifu2, LIU Shangli2, TANG Bing1. Research Progress on Educational Data Mining [J]. Computer Engineering and Applications, 2019, 55(14): 15-23.

Research and parallelization of DBSCAN algorithm

DBSCAN算法研究及并行化实现

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics