Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (3): 57-60.

Previous Articles     Next Articles

Performance analysis of Hadoop for handling small files in single node

YUAN Yu1,2, CUI Chaoyuan2, WU Yun2 , CHEN Zhuhong2,3   

  1. 1.College of Computing & Communication Engineering, Graduate University of the Chinese Academy of Sciences, Beijing 100049, China
    2.Institute of Intelligent Machine, Chinese Academy of Sciences, Hefei 230031, China
    3.School?of?Information?Science?and Technology, University of Science and Technology of China, Hefei 230026, China
  • Online:2013-02-01 Published:2013-02-18

单机下Hadoop小文件处理性能分析

袁  玉1,2,崔超远2,乌  云2,陈祝红2,3   

  1. 1.中国科学院研究生院 计算与通信工程学院,北京 100049
    2.中国科学院 合肥智能机械研究所,合肥 230031
    3.中国科技大学 信息科学技术学院,合肥 230026

Abstract: Hadoop is a software framework that supports distributed processing of large data sets, that is it works well with large files. There’s a doubt whether it also works well with small files. Taking the word frequency statistic as an example, through experiments with some typical file sets in a single node, Hadoop’s performance on small files under different FileInputFormat is compared. And the performance differences are explained by Hadoop’s own execution principle. Through analysis, packing many small files into one split can improve Hadoop performance.

Key words: Hadoop, Hadoop Distributed File System(HDFS), MapReduce, small files handling, FileInputFormat

摘要: Hadoop主要是针对大量数据进行分布式处理的软件框架,即适合于处理大文件,但它们是否也适合处理小文件值得商榷。以词频统计为例,通过在单机环境下一些典型文件测试集的实验,对比了不同文件输入格式对Hadoop处理小文件性能的差异。从Hadoop的工作流程和原理上解释了出现此性能差异的原因。通过分析得出多个小文件整合为一个数据片split有助于改善Hadoop处理小文件性能。

关键词: Hadoop, Hadoop的分布式文件系统(HDFS), MapReduce, 小文件处理, 文件输入格式