Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (27): 22-26.

Previous Articles     Next Articles

Improved Canopy-Kmeans algorithm based on MapReduce

MAO Dianhui   

  1. School of Computer and Information Engineering, Beijing Technology and Business University, Beijing 100048, China
  • Online:2012-09-21 Published:2012-09-24

基于MapReduce的Canopy-Kmeans改进算法

毛典辉   

  1. 北京工商大学 计算机与信息工程学院,北京 100048

Abstract: In order to solve the problem that how to void random Canopy selection of Canopy-Kmeans algorithm, this paper introduces an improved algorithm based on the minimum and maximum principle and realizes processing massive data based on MapReduce framework. Meanwhile, the algorithm is carried out in massive Internet news aggregation. The experiments show that the strategy of Canopy selection based on the minimum and maximum principle has higher classification accuracy and noise immunity compared to random strategy.

Key words: Canopy-Kmeans, MapReduce, distributed aggregation

摘要: 针对分布式Canopy-Kmeans算法中Canopy选取的随机性问题,采用“最小最大原则”对该算法进行了改进,避免了Cannopy选取的盲目性;采用MapReduce并行计算框架对算法进行了并行扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景。以海量互联网新闻信息聚类作为应用背景,对改进后的算法进行了实验分析。实验结果表明:该方法较随机挑选Canopy策略在分类准确率以及抗噪能力上都明显提高,而且在处理海量数据时表现出较大的性能优势。

关键词: Canopy-Kmeans算法, MapReduce, 分布式聚类