Highly efficient parallel algorithm of [K]-Medoids based on Hadoop platform

Abstract

Abstract: In view of the traditional [K]-Medoids algorithm is sensitive to the initial clustering center, slow convergence speed, and in large data environment facing the bottleneck problem of memory and CPU processing speed, through improving the initial center options and replacement strategy of using the Hadoop distributed computing platform combined with parallel random sampling strategy based on Top [K], realizes a highly efficient and stable [K]-Medoids parallel algorithm, and by adjusting the Hadoop platform, realize the further optimization of the algorithm. Experiments show that the improved [K]-Medoids algorithm not only has a good speedup, the convergence and the clustering accuracy are also improved.

Key words: [K]-Medoids, distributed computation, Hadoop, parallel sampling

摘要： 针对传统[K]-Medoids算法对初始聚类中心敏感、收敛速度慢，以及在大数据环境下所面临的内存容量和CPU处理速度的瓶颈问题，从改进初始中心选择方案和中心替换策略入手，利用Hadoop分布式计算平台结合基于Top [K]的并行随机采样策略，实现了一种高效稳定的[K]-Medoids并行算法，并且通过调整Hadoop平台，实现算法的进一步优化。实验证明，改进的K-Medoids算法不仅有良好的加速比，其收敛性和聚类精度均得到了改善。

关键词: [K]-Medoids, 分布式计算, Hadoop, 并行采样

WANG Yonggui, DAI Wei, WU Chao. Highly efficient parallel algorithm of [K]-Medoids based on Hadoop platform[J]. Computer Engineering and Applications, 2015, 51(16): 47-54.

王永贵，戴伟，武超. 一种基于Hadoop的高效[K]-Medoids并行算法[J]. 计算机工程与应用, 2015, 51(16): 47-54.

[1]	WU Dongyang, DOU Jianping, LI Jun. Design of Digital Twin System for Quadrotor [J]. Computer Engineering and Applications, 2021, 57(16): 237-244.
[2]	LI Leixiao, DENG Dan, LI Jie, WANG Yongsheng. All-to-All Comparison Computing Data Distribution Strategy Based on Particle Swarm Optimization [J]. Computer Engineering and Applications, 2021, 57(15): 109-117.
[3]	LIU Jun, LI Wei, WU Mengting, CHEN Qifeng. New Design of Image Parallel Processing Model Based on Hadoop Platform [J]. Computer Engineering and Applications, 2019, 55(6): 186-190.
[4]	FENG Jianxin1，2, LI Hui1，2, LIU Zhiguo1，2. New Flame Color Space—IFCS [J]. Computer Engineering and Applications, 2019, 55(5): 203-210.
[5]	CHEN Xining1，2, MA Weiyin3, LI Li4. Fingerprint Localization Data Processing Method Based on Spark [J]. Computer Engineering and Applications, 2019, 55(4): 79-83.
[6]	WANG Jingyu, LUAN Junqing, TAN Yuesheng. Research on Big Data Access Control Model Based on Data Sensitivity [J]. Computer Engineering and Applications, 2019, 55(23): 70-77.
[7]	YIN Qiao1，2, WEI Zhanchen1，2, HUANG Qiulan1, SUN Gongxing1, SHI Jingyan1. Development and Application of Hadoop Massive Data Migration System [J]. Computer Engineering and Applications, 2019, 55(13): 66-71.
[8]	CAO Jingjing1, REN Xinxin2, XU Xianhao2. Research on Logistics Path Frequent Patterns Based on Parallel Apriori [J]. Computer Engineering and Applications, 2019, 55(11): 257-264.
[9]	WU Yaoyao1, YANG Geng1，2. Distributed File System Load Balancing in Cloud Environment [J]. Computer Engineering and Applications, 2019, 55(10): 67-72.
[10]	MA Zhen, HALIDAN Abudureyimu, LI Xitong. Research on access optimization of small files in massive sample data sets [J]. Computer Engineering and Applications, 2018, 54(22): 80-84.
[11]	SONG Feibao, JIA Ruiyu. Elite genetic K-medoids clustering algorithm [J]. Computer Engineering and Applications, 2018, 54(22): 144-149.
[12]	WANG Yongchao, LU Mingming. Research and implementation of big data migration for financial industry [J]. Computer Engineering and Applications, 2018, 54(13): 93-99.
[13]	ZHANG Renqi, LI Jianhua, FAN Lei. Research on parallel strategy of convolution neural network in distributed environment [J]. Computer Engineering and Applications, 2017, 53(8): 1-7.
[14]	XIA Xiaoyun, ZHANG Renbin, XIE Rui, WANG Cong. MapReduce approach for defect inspection of TFT-LCD [J]. Computer Engineering and Applications, 2017, 53(5): 202-206.
[15]	MIAO Xiaolong1, CHEN Hao1, ZHONG Jiang2. Energy-conserving strategies of file storage based on cluster scale adjustment [J]. Computer Engineering and Applications, 2017, 53(24): 80-85.

Highly efficient parallel algorithm of [K]-Medoids based on Hadoop platform

一种基于Hadoop的高效[K]-Medoids并行算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics