Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (1): 161-167.DOI: 10.3778/j.issn.1002-8331.1709-0161

Previous Articles     Next Articles

Optimization of Distributed K-means Algorithm with RDD-based Extended Index Layer

MA Jing1,2, LI Li3   

  1. 1.Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
    2.Guizhou Provincial Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
    3.College of Software, Shanghai Jiao Tong University, Shanghai 200240, China
  • Online:2019-01-01 Published:2019-01-07

RDD上扩展索引层优化的分布式K-means算法

马  菁1,2,李  力3   

  1. 1.上海交通大学 计算机科学与工程系,上海 200240
    2.贵州大学 贵州省公共大数据重点实验室,贵阳 550025
    3.上海交通大学 软件学院,上海 200240

Abstract: K-means is a classical clustering algorithm. Distributed computing is utilized to improve its scalability on large-scale data in recent years. However, traditional disk-based distributed systems still have a large amount of I/O consumption. This paper extends the machine learning library MLlib in Spark, a memory-based distributed platform with low I/O consumption and good fault tolerance mechanism. This paper develops a RDD-based index layer, which exploits a two-level index comp to rising multiple index strategies, including R-tree, quad tree, KD-tree and Ball tree. By overriding the data partitioning methods, this paper optimizes K-means algorithm by preprocessing the points with close spatial distance. The index layer is used to store the general information of the corresponding point set so that the search space pruning can be used in K-means algorithm. The experimental results show that the index layer can prune the search space by more than 40%, and improve the efficiency by 21% compared to the na?ve distributed K-means algorithm.

Key words: K-means, clustering algorithm, Spark, R-tree, quad tree, KD-tree, Ball-tree

摘要: K-means是经典的聚类算法,为了适应大规模数据,很多研究利用分布式计算提高其扩展性。但传统基于磁盘的分布式系统仍然存在大量I/O消耗,在基于内存的Spark系统上实现,在继承Spark平台低读写消耗和良好容错性等优点的基础上,扩展了Spark的机器学习MLlib库,在此之上增加一个索引层,引入包含多种策略的基于RDD的双级索引机制,采用新的数据划分方式,对空间距离相近的点的信息进行预处理,利用索引存储其对应的点集的概括信息,以便在K-means算法中对搜索空间剪枝,从而达到对K-means算法的优化。实验结果表明,索引层能够剪枝搜索空间达40%以上,相对无优化的分布式K-means,提升效率达21%,具有较好的可扩展性。

关键词: K-means算法, 聚类算法, Spark系统, R树, 四叉树, KD树, Ball树