计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (22): 95-98.

• 研发、设计、测试 • 上一篇    下一篇

基于MapReduce的Web日志挖掘

李  彬,刘莉莉   

  1. 中国矿业大学 计算机科学与技术学院,江苏 徐州 221116
  • 出版日期:2012-08-01 发布日期:2012-08-06

Weblog mining based on MapReduce

LI Bin, LIU Lili   

  1. School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China
  • Online:2012-08-01 Published:2012-08-06

摘要: 针对单一CPU节点的Web数据挖掘系统在挖掘Web海量数据源时存在的计算瓶颈问题,利用云计算的分布式处理和虚拟化技术优势以及蚁群算法并行性的优点,设计一种基于Map/Reduce架构的Web日志挖掘算法。为进一步验证该算法的高效性,通过搭建Hadoop平台,利用该算法挖掘Web日志中用户的偏爱访问路径。实验结果表明,充分利用了集群系统的分布式计算能力处理大量的Web日志文件,可以大大地提高Web数据挖掘的效率。

关键词: 云计算, Map/Reduce, Hadoop平台, Web日志挖掘, 蚁群算法

Abstract: The current data mining system based on single CPU has developed to a bottleneck to deal with mass data from Web. Using the advantage of cloud computing distributed processing, virtualization and parallelism of ant colony algorithm, this paper presents a weblog mining algorithm based on Map/Reduce’s framework. To further verify the high efficiency of the algorithm, it uses the algorithm to mine users’ preferred access path based on Hadoop platform. Experimental results show that, using distributed algorithm to process large number of Weblog files in the cluster, can significantly improve the efficiency of Web data mining.

Key words: cloud computing, Map/Reduce, Hadoop platform, Web log mining, ant colony algorithm