Improved Canopy-Kmeans algorithm based on MapReduce

Abstract

Abstract: In order to solve the problem that how to void random Canopy selection of Canopy-Kmeans algorithm, this paper introduces an improved algorithm based on the minimum and maximum principle and realizes processing massive data based on MapReduce framework. Meanwhile, the algorithm is carried out in massive Internet news aggregation. The experiments show that the strategy of Canopy selection based on the minimum and maximum principle has higher classification accuracy and noise immunity compared to random strategy.

Key words: Canopy-Kmeans, MapReduce, distributed aggregation

摘要： 针对分布式Canopy-Kmeans算法中Canopy选取的随机性问题，采用“最小最大原则”对该算法进行了改进，避免了Cannopy选取的盲目性；采用MapReduce并行计算框架对算法进行了并行扩展，使之能够充分利用集群的计算和存储能力，从而适应海量数据的应用场景。以海量互联网新闻信息聚类作为应用背景，对改进后的算法进行了实验分析。实验结果表明：该方法较随机挑选Canopy策略在分类准确率以及抗噪能力上都明显提高，而且在处理海量数据时表现出较大的性能优势。

关键词: Canopy-Kmeans算法, MapReduce, 分布式聚类

MAO Dianhui. Improved Canopy-Kmeans algorithm based on MapReduce[J]. Computer Engineering and Applications, 2012, 48(27): 22-26.

毛典辉. 基于MapReduce的Canopy-Kmeans改进算法[J]. 计算机工程与应用, 2012, 48(27): 22-26.

[1]	CHEN Yuanwen. Application of MapReduce Technology in Problem of Material Transportation and Stowage [J]. Computer Engineering and Applications, 2021, 57(12): 273-278.
[2]	LIU Jun, LI Wei, WU Mengting, CHEN Qifeng. New Design of Image Parallel Processing Model Based on Hadoop Platform [J]. Computer Engineering and Applications, 2019, 55(6): 186-190.
[3]	JI Changqing1，2, XIAO Peng3, LIU Chang4, WANG Zumin2, XI Fang2, SHAO Yinbo1, LI Zeyu2. Mobile Medical Call Algorithms Based on Spatial kNN Query [J]. Computer Engineering and Applications, 2019, 55(2): 206-212.
[4]	WANG Dezheng1, ZHANG Yinong1, YANG Fan2. Implementation of parallel PLS algorithm of process monitoring using MapReduce [J]. Computer Engineering and Applications, 2018, 54(24): 61-65.
[5]	CHEN Wanghu, YU Maoyi, MA Shengjun. Training BP neural networks with MapReduce based on sample data slice disruptions [J]. Computer Engineering and Applications, 2018, 54(2): 137-143.
[6]	XIA Xiaoyun, ZHANG Renbin, XIE Rui, WANG Cong. MapReduce approach for defect inspection of TFT-LCD [J]. Computer Engineering and Applications, 2017, 53(5): 202-206.
[7]	WANG Yonggui1, ZHANG Yan1, YANG Dongdong2. Research on algorithm of community discovery of wireless city based on MapReduce [J]. Computer Engineering and Applications, 2017, 53(4): 106-112.
[8]	LUO Jun, LI Jinhua. LSHBMRPK-means algorithm and its application. Computer Engineering and Applications [J]. Computer Engineering and Applications, 2017, 53(21): 62-67.
[9]	CHEN Yanan, ZHU Xijun. Association analysis of TCM asthma medication combination based on Hadoop [J]. Computer Engineering and Applications, 2017, 53(13): 95-98.
[10]	LI Sanmiao, LI Longshu. Performance analysis of four methods for handling small files in Hadoop [J]. Computer Engineering and Applications, 2016, 52(9): 44-49.
[11]	XIONG Zheng1，2, WANG Jinming2, ZHENG Haiyan1，2, LI Kunming1, XU Lizhen2, CHONG Zhihong2. RDF pattern matching algorithm with sorted view [J]. Computer Engineering and Applications, 2016, 52(8): 62-69.
[12]	WANG Yonggui, WU Chao, DAI Wei. K-means algorithm of random sample based on MapReduce [J]. Computer Engineering and Applications, 2016, 52(8): 74-79.
[13]	ZHANG Hong1，2, WANG Xiaoming1, CAO Jie2, MA Yanhong3, GUO Yirong1, WANG Min1. Research on optimized MapReduce model of Hadoop cloud platform [J]. Computer Engineering and Applications, 2016, 52(22): 22-25.
[14]	SHU Xiaowei, YANG Geng, NA Haiyang. Research on parallel crypt inverted index [J]. Computer Engineering and Applications, 2016, 52(20): 14-19.
[15]	ZHOU Guojun. Study of multi-keywords sorting method based on Hadoop [J]. Computer Engineering and Applications, 2016, 52(17): 79-83.

Improved Canopy-Kmeans algorithm based on MapReduce

基于MapReduce的Canopy-Kmeans改进算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics