计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (20): 112-117.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于MapReduce的海量数据挖掘技术研究

李伟卫1,赵  航2,张  阳1,王  勇3   

  1. 1.西北农林科技大学 信息工程学院,陕西 杨凌 712100
    2.西安电子科技大学 机电工程学院,西安 710072
    3.西北工业大学 计算机学院,西安 710072
  • 出版日期:2013-10-15 发布日期:2013-10-30

Research on massive data mining based on MapReduce

LI Weiwei1, ZHAO Hang2, ZHANG Yang1, WANG Yong3   

  1. 1.College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China
    2.School of Mechano-Electronic Engineering, Xidian University, Xi’an 710072, China
    3.School of Computer, Northwestern Polytechnical University, Xi’an 710072, China
  • Online:2013-10-15 Published:2013-10-30

摘要: MapReduce是一种编程模型,可以运行在异构环境下,编程简单,不必关心底层实现细节,用于大规模数据集的并行运算。将MapReduce应用在数据挖掘的三个算法中:朴素贝叶斯分类算法、K-modes聚类算法和ECLAT频繁项集挖掘算法。实验结果表明,在保证算法准确率的前提下,MapReduce可以有效提高海量数据挖掘工作的效率。

关键词: 云计算, 数据挖掘, Hadoop, MapReduce

Abstract: MapReduce is a programming model which can run in a heterogeneous environment for mining massive volume of data. It is simple to be implemented without paying attention to the underlying details and can be used for large-scale parallel computing. In this paper, three data mining algorithms, Naive Bayes, K-modes, ECLAT are implemented by employing the MapReduce programming model. The results indicate that MapReduce can perform the data mining tasks on massive volume of data efficiently.

Key words: cloud computing, data mining, Hadoop, MapReduce