Computer Engineering and Applications ›› 2016, Vol. 52 ›› Issue (9): 44-49.

Previous Articles     Next Articles

Performance analysis of four methods for handling small files in Hadoop

LI Sanmiao, LI Longshu   

  1. School of Computer Science and Technology, Anhui University, Hefei 230601, China
  • Online:2016-05-01 Published:2016-05-16

Hadoop中处理小文件的四种方法的性能分析

李三淼,李龙澍   

  1. 安徽大学 计算机科学与技术学院,合肥 230601

Abstract: Hadoop is designed to store and analyze large data, and it is good at processing large data sets. However, in practical applications, there are a large number of small files. There are four methods of handling massive small files generally which are default input format TextInputFormat, CombineFileInputFormat which is designed for handling small files, SequenceFile technology and Harballing technology. In order to compare the performance of these four technologies dealing with a large number of small files in the same Hadoop distributed environment, it uses a word frequency statistics program with typical data sets to compare the performance differences between the four small files processing technology. Experimental studies have shown that, when dealing with a large number of small files in different needs, choosing the appropriate handling method can improve the processing efficiency of a large number of small files to a large extent.

Key words: Hadoop, small files handling, Hadoop Distributed File System(HDFS), MapReduce, big data

摘要: Hadoop的设计初衷是为了存储和分析大数据,其最擅长处理的是大数据集。但是在实际应用中,却存在着大量的小文件。一般情况下有四种处理海量小文件的方法,分别为默认输入格式TextInputFormat、为处理小文件而设计的CombineFileInputFormat输入格式、SequenceFile技术以及Harballing技术。为了比较在相同的Hadoop分布式环境下这四种技术处理大量小文件时的性能,选用了典型的数据集,利用词频统计程序,来比较四种小文件处理技术的性能差异。实验研究表明,在不同需求下处理大量小文件的时候,选用适当的处理方法能够在很大程度上提高大量小文件的处理效率。

关键词: Hadoop, 小文件处理, Hadoop的分布式文件系统(HDFS), MapReduce, 大数据