Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (5): 232-244.DOI: 10.3778/j.issn.1002-8331.2203-0018

• Big Data and Cloud Computing • Previous Articles     Next Articles

Parallel OPTICS by Using Mean Distance and Relevance Marks

ZHENG Jian, YU Xin   

  1. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
  • Online:2023-03-01 Published:2023-03-01

使用均值距离与关联性标记的并行OPTICS算法

郑剑,余鑫   

  1. 江西理工大学 信息工程学院,江西 赣州 341000

Abstract: The main target of this paper is to design a parallel optics algorithm by using mean distance and relevance marks based on MapReduce, noted as POMDRM-MR, to deal with the problems of unreasonable data division, low accuracy of clustering results, the results are greatly affected by parameters and low efficiency of parallelization in parallel density-based clustering algorithm in big data. In POMDRM-MR, an approach called partition with reduced boundary points based on dimension sparsity(DS-PRBP) is proposed to divide the dataset. For each partition, the algorithm called marking and ordering points to identify the cluster structure(MOPTICS) is proposed to construct the correlations between data points and core points and mark the number of iterations, the field mean distance strategy(FMD) is proposed to calculate the field mean distance of data points instead of the reachable distance in measuring distance. After outputting sequence, combined with reordering and extracting clusters algorithm(REC), the sequence is sorted twice which improves the accuracy and stability. In merging global clusters, an approach called using boundary density to filter local cluster(BD-FLC) is used to calculate and filter local clusters with similar density. And based on the union-type merging of n-ary trees and MapReduce, the parallel local cluster merging algorithm(MCNT-MR) is proposed to get the clustering results faster and merge local clusters in parallel which improves efficiency of merging local clusters. The experiments show that POMDRM-MR algorithm has better effect, and better parallelization performance on large-scale datasets.

Key words: big data, density-based clustering, MapReduce, OPTICS, partition with reduce boundary points(PRBP)

摘要: 针对大数据环境下传统并行密度聚类算法中存在的数据划分不合理,聚类结果准确度不高,结果受参数影响较大以及并行效率低等问题,提出一种MapReduce下使用均值距离与关联性标记的并行OPTICS算法——POMDRM-MR。算法使用一种基于维度稀疏度的减少边界点划分策略(DS-PRBP),划分数据集;针对各个分区,提出标记点排序识别簇算法(MOPTICS),构建数据点与核心点之间的关联性,并标记数据点迭代次数,在距离度量中,使用领域均值距离策略(FMD),计算数据点的领域均值距离,代替可达距离排序,输出关联性标记序列;最后结合重排序序列提取簇算法(REC),对输出序列进行二次排序并提取簇,提高算法局部聚类的准确性和稳定性;在合并全局簇时,算法提出边界密度筛选策略(BD-FLC),计算筛选密度相近局部簇;又基于[n]叉树的并集型合并与MapReduce模型,提出并行局部簇合并算法(MCNT-MR),加快局部簇收敛,并行合并局部簇,提升全局簇合并效率。对照实验表明,POMDRM-MR算法聚类效果更佳,且在大规模数据集下算法的并行化性能更好。

关键词: 大数据, 密度聚类, MapReduce, OPTICS, PRBP