Optimization of Distributed K-means Algorithm with RDD-based Extended Index Layer

doi:10.3778/j.issn.1002-8331.1709-0161

Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (1): 161-167.DOI: 10.3778/j.issn.1002-8331.1709-0161

Previous Articles Next Articles

Optimization of Distributed K-means Algorithm with RDD-based Extended Index Layer

MA Jing1，2, LI Li3

1.Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
2.Guizhou Provincial Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
3.College of Software, Shanghai Jiao Tong University, Shanghai 200240, China

Online:2019-01-01 Published:2019-01-07

RDD上扩展索引层优化的分布式K-means算法

马菁1，2，李力3

1.上海交通大学计算机科学与工程系，上海 200240
2.贵州大学贵州省公共大数据重点实验室，贵阳 550025
3.上海交通大学软件学院，上海 200240

Abstract

Abstract: K-means is a classical clustering algorithm. Distributed computing is utilized to improve its scalability on large-scale data in recent years. However, traditional disk-based distributed systems still have a large amount of I/O consumption. This paper extends the machine learning library MLlib in Spark, a memory-based distributed platform with low I/O consumption and good fault tolerance mechanism. This paper develops a RDD-based index layer, which exploits a two-level index comp to rising multiple index strategies, including R-tree, quad tree, KD-tree and Ball tree. By overriding the data partitioning methods, this paper optimizes K-means algorithm by preprocessing the points with close spatial distance. The index layer is used to store the general information of the corresponding point set so that the search space pruning can be used in K-means algorithm. The experimental results show that the index layer can prune the search space by more than 40%, and improve the efficiency by 21% compared to the na?ve distributed K-means algorithm.

Key words: K-means, clustering algorithm, Spark, R-tree, quad tree, KD-tree, Ball-tree

摘要： K-means是经典的聚类算法，为了适应大规模数据，很多研究利用分布式计算提高其扩展性。但传统基于磁盘的分布式系统仍然存在大量I/O消耗，在基于内存的Spark系统上实现，在继承Spark平台低读写消耗和良好容错性等优点的基础上，扩展了Spark的机器学习MLlib库，在此之上增加一个索引层，引入包含多种策略的基于RDD的双级索引机制，采用新的数据划分方式，对空间距离相近的点的信息进行预处理，利用索引存储其对应的点集的概括信息，以便在K-means算法中对搜索空间剪枝，从而达到对K-means算法的优化。实验结果表明，索引层能够剪枝搜索空间达40%以上，相对无优化的分布式K-means，提升效率达21%，具有较好的可扩展性。

关键词: K-means算法, 聚类算法, Spark系统, R树, 四叉树, KD树, Ball树

MA Jing1，2, LI Li3. Optimization of Distributed K-means Algorithm with RDD-based Extended Index Layer[J]. Computer Engineering and Applications, 2019, 55(1): 161-167.

马菁1，2，李力3. RDD上扩展索引层优化的分布式K-means算法[J]. 计算机工程与应用, 2019, 55(1): 161-167.

[1]	LI Junli. Parallel Mutual-Information Computation of Categorical Data Based on Spark [J]. Computer Engineering and Applications, 2021, 57(7): 95-100.
[2]	LI Shuo, LIANG Yi. Prediction Model of Execution Time for Batch Application in Spark [J]. Computer Engineering and Applications, 2021, 57(5): 79-87.
[3]	WANG Changlong, ZHANG Yuandong, MIAO Hong, YANG Yuheng. Application of Double Channel Convolutional Neural Network in Pumpkin Diseases Identification [J]. Computer Engineering and Applications, 2021, 57(5): 183-189.
[4]	WANG Junling, LU Xinming. Video Key Frame Extraction Algorithm Based on Semantic Correlation [J]. Computer Engineering and Applications, 2021, 57(4): 192-198.
[5]	WANG Fuyin, ZHANG Desheng, ZHANG Xiao. Adaptive Density Peaks Clustering Algorithm Combining with Whale Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(3): 94-102.
[6]	ZHANG Ziran, HUANG Weihua, CHEN Yang, ZHANG Zheng, LI Ziyuan. Improved Ant Colony Path Planning Algorithm Based on Bidirectional Search [J]. Computer Engineering and Applications, 2021, 57(21): 270-277.
[7]	DING Songyang, TIAN Qingyun. Density Peak Clustering Algorithm Based on Ball-Tree [J]. Computer Engineering and Applications, 2021, 57(20): 90-96.
[8]	ZHANG Kehua, CHENG Weiqing. Differential Privacy Clustering Algorithm Based on Spatial Dynamic Partition [J]. Computer Engineering and Applications, 2021, 57(2): 97-103.
[9]	WENG Yushang, XIAO Jinqiu, XIA Yu. Strip Surface Defect Detection Based on Improved Mask R-CNN Algorithm [J]. Computer Engineering and Applications, 2021, 57(19): 235-242.
[10]	CHENG Jingyi, DUAN Xianhua, ZHU Wei. Research on Metal Surface Defect Detection by Improved YOLOv3 [J]. Computer Engineering and Applications, 2021, 57(19): 252-258.
[11]	BAI Lu, ZHAO Xin, KONG Yuting, ZHANG Zhenghang, SHAO Jinxin, QIAN Yurong. Survey of Spectral Clustering Algorithms [J]. Computer Engineering and Applications, 2021, 57(14): 15-26.
[12]	XIANG Yixuan, JIANG He, PAN Pinchen, SUN Conghui. Study on [K]-means Clustering Algorithm of Quadratic Power Coupling [J]. Computer Engineering and Applications, 2021, 57(14): 95-102.
[13]	PAN Chengsheng, ZHANG Bin, LYU Yana, DU Xiuli, QIU Shaoming. K-Means Text Clustering Based on Improved Gray Wolf Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(1): 188-193.
[14]	GAO Weijun, SHI Yang, YANG Jie, ZHANG Chunxia. An Improved Lightweight Head Detection Method [J]. Computer Engineering and Applications, 2021, 57(1): 207-212.
[15]	LU Junjie, HUANG Jinquan, LU Feng. Likelihood K-means Clustering for Gas Path Failure Diagnostics of Turbofan Engine [J]. Computer Engineering and Applications, 2020, 56(9): 136-141.

Optimization of Distributed K-means Algorithm with RDD-based Extended Index Layer

RDD上扩展索引层优化的分布式K-means算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics