计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (14): 133-137.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于MapReduce的并行模糊C均值算法

虞倩倩,戴月明   

  1. 江南大学 物联网工程学院,江苏 无锡 214122
  • 出版日期:2013-07-15 发布日期:2013-07-31

Parallel fuzzy C-means algorithm based on MapReduce

YU Qianqian, DAI Yueming   

  1. School of IOT Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2013-07-15 Published:2013-07-31

摘要: 模糊C均值是一种重要的软聚类算法,针对模糊C均值的随着数据量的增加,时间复杂度过高的缺点,提出了一种基于MapReduce的并行模糊C均值算法。算法重新设计模糊C均值,使其符合MapReduce的基于key/value的编程模型,并行计算数据集到中心点的隶属度,并重新计算出新的聚类中心,提高了模糊C均值处理大容量数据的计算效率。实验结果表明,基于MapReduce的并行模糊C均值算法具有较高的加速比和扩展性。

关键词: 模糊C均值, 并行计算, MapReduce编程模型, 数据挖掘, 云计算

Abstract: Fuzzy C-means?is an important?soft-clustering algorithm, but with the increased amount of data the time complexity will be increased. In this paper, a parallel?fuzzy?C-means?algorithm based on?the MapReduce is proposed. The fuzzy?C-means?algorithm is redesigned to meet the MapReduce programming model. The membership degree of data set to the center is computed in parallel, and the new cluster center is re-calculated, so that the higher calculating efficiency of processing large amount of data can be got. The experimental results show that the parallel?fuzzy?C-means?algorithm based on?the MapReduce has the advantages of both high speedup and good scalability.

Key words: fuzzy C-means, parallel computing, MapReduce, data mining, cloud computing