Parallel OPTICS by Using Mean Distance and Relevance Marks

doi:10.3778/j.issn.1002-8331.2203-0018

Abstract

Abstract: The main target of this paper is to design a parallel optics algorithm by using mean distance and relevance marks based on MapReduce, noted as POMDRM-MR, to deal with the problems of unreasonable data division, low accuracy of clustering results, the results are greatly affected by parameters and low efficiency of parallelization in parallel density-based clustering algorithm in big data. In POMDRM-MR, an approach called partition with reduced boundary points based on dimension sparsity（DS-PRBP） is proposed to divide the dataset. For each partition, the algorithm called marking and ordering points to identify the cluster structure（MOPTICS） is proposed to construct the correlations between data points and core points and mark the number of iterations, the field mean distance strategy（FMD） is proposed to calculate the field mean distance of data points instead of the reachable distance in measuring distance. After outputting sequence, combined with reordering and extracting clusters algorithm（REC）, the sequence is sorted twice which improves the accuracy and stability. In merging global clusters, an approach called using boundary density to filter local cluster（BD-FLC） is used to calculate and filter local clusters with similar density. And based on the union-type merging of n-ary trees and MapReduce, the parallel local cluster merging algorithm（MCNT-MR） is proposed to get the clustering results faster and merge local clusters in parallel which improves efficiency of merging local clusters. The experiments show that POMDRM-MR algorithm has better effect, and better parallelization performance on large-scale datasets.

Key words: big data, density-based clustering, MapReduce, OPTICS, partition with reduce boundary points（PRBP）

摘要： 针对大数据环境下传统并行密度聚类算法中存在的数据划分不合理，聚类结果准确度不高，结果受参数影响较大以及并行效率低等问题，提出一种MapReduce下使用均值距离与关联性标记的并行OPTICS算法——POMDRM-MR。算法使用一种基于维度稀疏度的减少边界点划分策略（DS-PRBP），划分数据集；针对各个分区，提出标记点排序识别簇算法（MOPTICS），构建数据点与核心点之间的关联性，并标记数据点迭代次数，在距离度量中，使用领域均值距离策略（FMD），计算数据点的领域均值距离，代替可达距离排序，输出关联性标记序列；最后结合重排序序列提取簇算法（REC），对输出序列进行二次排序并提取簇，提高算法局部聚类的准确性和稳定性；在合并全局簇时，算法提出边界密度筛选策略（BD-FLC），计算筛选密度相近局部簇；又基于[n]叉树的并集型合并与MapReduce模型，提出并行局部簇合并算法（MCNT-MR），加快局部簇收敛，并行合并局部簇，提升全局簇合并效率。对照实验表明，POMDRM-MR算法聚类效果更佳，且在大规模数据集下算法的并行化性能更好。

关键词: 大数据, 密度聚类, MapReduce, OPTICS, PRBP

ZHENG Jian, YU Xin. Parallel OPTICS by Using Mean Distance and Relevance Marks[J]. Computer Engineering and Applications, 2023, 59(5): 232-244.

郑剑, 余鑫. 使用均值距离与关联性标记的并行OPTICS算法[J]. 计算机工程与应用, 2023, 59(5): 232-244.

References

[1] CHEN M S，HAN J W，YU P S.Data mining：an overview from a database perspective[J].IEEE Transactions on Knowledge and Data Engineering，1996，8（6）：866-883.
[2] ESTER M，KRIEGEL H，SANDER J，et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining，Portland，Aug 2-4，1996：226-231.
[3] ANKERST M，BREUNIG M M，KRIEGEL H P，et al.OPTICS：ordering points to identify the clustering structure[C]//Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data，Philadelphia，Jun 1-3，1999：49-60.
[4] 金辉，钱雪忠.自然最近邻优化的密度峰值聚类算法[J].计算机科学与探索，2019，13（4）：711-720.
JIN H，QIAN X Z.Optimized density peak clustering algorithm by natural nearest neighbor[J].Journal of Frontiers of Computer Science and Technology，2019，13（4）：711-720.
[5] 李文杰，闫世强，蒋莹，等.自适应确定DBSCAN算法参数的算法研究[J].计算机工程与应用，2019，55（5）：1-7.
LI W J，YAN S Q，JIANG Y，et al.Research on method of self-adaptive determination of DBSCAN algorithm parameters[J].Computer Engineering and Applications，2019，55（5）：1-7.
[6] 胡健，朱海湾，毛伊敏.基于自适应蜂群优化的DBSCAN聚类算法[J].计算机工程与应用，2019，55（14）：105-114.
HU J，ZHU H W，MAO Y M.DBSCAN clustering algorithm based on adaptive bee colony optimization[J].Computer Engineering and Applications，2019，55（14）：105-114.
[7] 王珊，王会举，覃雄派，等.架构大数据：挑战、现状与展望[J].计算机学报，2011，34（10）：1741-1752.
WANG S，WANG H J，QIN X P，et al.Architecting big data：challenges，studies and forecasts[J].Chinese Journal of Computers，2011，34（10）：1741-1752.
[8] 王万良，张兆娟，高楠，等.基于人工智能技术的大数据分析方法研究进展[J].计算机集成制造系统，2019，25（3）：529-547.
WANG W L，ZHANG Z J，GAO N，et al.Progress of big data analytics methods based on artificial intelligence technology[J].Computer Integrated Manufacturing Systems，2019，25（3）：529-547.
[9] 宋杰，孙宗哲，毛克明，等.MapReduce大数据处理平台与算法研究进展[J].软件学报，2017，28（3）：514-543.
SONG J，SUN Z Z，MAO K M，et al.Research advance on MapReduce based big data processing platforms and algorithms[J].Journal of Software，2017，28（3）：514-543.
[10] 胡小强，吴翾，闻立杰，等.基于Spark的并行分布式过程挖掘算法[J].计算机集成制造系统，2019，25（4）：791-797.
HU X Q，WU X，WEN L J，et al.Parallel distributed process mining algorithm based on Spark[J].Computer Integrated Manufacturing Systems，2019，25（4）：791-797.
[11] WU X D，ZHU X Q，WU G Q，et al.Data mining with big data[J].IEEE Transactions on Knowledge and Data Engineering，2014，26（1）：97-107.
[12] ZHANG Y F，CHEN S M，YU G.Efficient distributed density peaks for clustering large data sets in Map-Reduce[J].IEEE Transactions on Knowledge and Data Engineering，2016，28（12）：3218-3230.
[13] ALJUMAILY H，LAEFER D F，CUADRA D.Urban point cloud mining based on density clustering and MapReduce[J].Journal of Computing in Civil Engineering，2017，31（5）：1-11.
[14] YU Y W，ZHAO J D，WANG X D，et al.Cludoop：an efficient distributed density-based clustering for big data using hadoop[J].International Journal of Distributed Sensor Networks，2015，11：579391.
[15] LI L J，XI Y.Research on clustering algorithm and its parallelization strategy[C]//Proceedings of the 2011 International Conference on Computational and Information Sciences，Chengdu，Oct 21-23，2011.Washington：IEEE Computer Society，2011：325-328.
[16] MAHRAN S，MAHAR K.Using grid for accelerating density-based clustering[C]//IEEE International Conference on Computer and Information Technology，Sydney，Australia，2008：35-40.
[17] 宋董飞，徐华.DBSCAN算法研究及并行化实现[J].计算机工程与应用，2018，54（24）：52-56.
SONG D F，XU H.Research and parallelization of DBSCAN algorithm[J].Computer Engineering and Applications，2018，54（24）：52-56.
[18] HUANG F，ZHU Q，ZHOU J，et al.Research on the parallelization of the DBSCAN clustering algorithm for spatial data mining based on the Spark platform[J].Remote Sensing，2017，9（12）：1301.
[19] 王兴，吴艺，蒋新华，等.大规模数据集下基于DBSCAN算法的增量并行化快速聚类[J].计算机应用与软件，2018，35（4）：269-275.
WANG X，WU Y，JIANG X H，et al.Incremental parallelization of fast clustering based on DBSCAN algorithm under largescale data set[J].Computer Applications and Software，2018，35（4）：269-275.
[20] DAI B R，LIN I C.Efficient Map/Reduce-based DBSCAN algorithm with optimized data partition[C]//IEEE Fifth International Conference on Cloud Computing，2012：59-66.
[21] 吴翠先，何少元.基于区间数的不确定性数据聚类算法：UD-OPTICS[J].计算机工程与科学，2019，41（7）：1303-1311.
WU C X，HE S Y.UD-OPTICS：an uncertain data clustering algorithm based on interval number[J].Computer Engineering & Science，2019，41（7）：1303-1311.
[22] XIONG Z Y，CHEN R T，ZHANG Y F，et al.Multi-density DBSCAN algorithm based on density levels partitioning[J].Journal of Information and Computational Science，2012，9（10）：2739-2749.
[23] BHARDWAJ S，DASH S K.VDMR-DBSCAN：varied density MapReduce DBSCAN[M]//Big data analytics.[S.l.]：Springer International Publishing，2015：134-150.
[24] HEIDARI S，ALBORZI M，RADFAR R，et al.Big data clustering with varied density based on MapReduce[J].Journal of Big Data，2019，6（1）：77.
[25] 曹佳豪，刘宇.基于多叉树和Spark的改进Apriori算法[J].信息技术，2018（6）：128-132.
CAO J H，LIU Y.An improved Apriori algorithm based on multi-tree and Spark[J].Information Technology，2018（6）：128-132.