Hadoop中处理小文件的四种方法的性能分析

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (9): 44-49.

Hadoop中处理小文件的四种方法的性能分析

李三淼，李龙澍

安徽大学计算机科学与技术学院，合肥 230601

出版日期:2016-05-01 发布日期:2016-05-16

Performance analysis of four methods for handling small files in Hadoop

LI Sanmiao, LI Longshu

School of Computer Science and Technology, Anhui University, Hefei 230601, China

Online:2016-05-01 Published:2016-05-16

摘要/Abstract

摘要： Hadoop的设计初衷是为了存储和分析大数据，其最擅长处理的是大数据集。但是在实际应用中，却存在着大量的小文件。一般情况下有四种处理海量小文件的方法，分别为默认输入格式TextInputFormat、为处理小文件而设计的CombineFileInputFormat输入格式、SequenceFile技术以及Harballing技术。为了比较在相同的Hadoop分布式环境下这四种技术处理大量小文件时的性能，选用了典型的数据集，利用词频统计程序，来比较四种小文件处理技术的性能差异。实验研究表明，在不同需求下处理大量小文件的时候，选用适当的处理方法能够在很大程度上提高大量小文件的处理效率。

关键词: Hadoop, 小文件处理, Hadoop的分布式文件系统（HDFS）, MapReduce, 大数据

Abstract: Hadoop is designed to store and analyze large data, and it is good at processing large data sets. However, in practical applications, there are a large number of small files. There are four methods of handling massive small files generally which are default input format TextInputFormat, CombineFileInputFormat which is designed for handling small files, SequenceFile technology and Harballing technology. In order to compare the performance of these four technologies dealing with a large number of small files in the same Hadoop distributed environment, it uses a word frequency statistics program with typical data sets to compare the performance differences between the four small files processing technology. Experimental studies have shown that, when dealing with a large number of small files in different needs, choosing the appropriate handling method can improve the processing efficiency of a large number of small files to a large extent.

Key words: Hadoop, small files handling, Hadoop Distributed File System（HDFS）, MapReduce, big data

李三淼，李龙澍. Hadoop中处理小文件的四种方法的性能分析[J]. 计算机工程与应用, 2016, 52(9): 44-49.

LI Sanmiao, LI Longshu. Performance analysis of four methods for handling small files in Hadoop[J]. Computer Engineering and Applications, 2016, 52(9): 44-49.

[1]	郑剑, 余鑫. 使用均值距离与关联性标记的并行OPTICS算法[J]. 计算机工程与应用, 2023, 59(5): 232-244.
[2]	梁天恺, 苏新铎, 黄宇恒, 徐天适, 张华俊, 曾碧. 智能化表格识别技术综述[J]. 计算机工程与应用, 2023, 59(12): 62-76.
[3]	毛伊敏, 耿俊豪, 陈亮. 结合信息论改进的并行深度森林算法[J]. 计算机工程与应用, 2022, 58(7): 106-115.
[4]	汪玉, 王鑫, 张淑娟, 郑国强, 赵龙, 郑高峰. 异构大数据环境中高效率知识融合方法的研究[J]. 计算机工程与应用, 2022, 58(6): 142-148.
[5]	毛伊敏, 张瑞朋, 高波. 大数据下基于特征图的深度卷积神经网络[J]. 计算机工程与应用, 2022, 58(15): 110-116.
[6]	谢智颖, 何原荣, 李清泉. 基于时空相关性的公交大数据清洗[J]. 计算机工程与应用, 2022, 58(1): 113-121.
[7]	吴昊，徐行健，孟繁军. 课程资源的融合知识图谱多任务特征推荐算法[J]. 计算机工程与应用, 2021, 57(21): 132-139.
[8]	吴东阳，窦建平，李俊. 四旋翼飞行器的数字孪生系统设计[J]. 计算机工程与应用, 2021, 57(16): 237-244.
[9]	李雷孝，邓丹，李杰，王永生. 基于粒子群优化的全比较计算数据分发策略[J]. 计算机工程与应用, 2021, 57(15): 109-117.
[10]	陈元文. MapReduce技术在物资调运与配载问题中的应用[J]. 计算机工程与应用, 2021, 57(12): 273-278.
[11]	李凌，顾晓梅，刘子豪. 多子域随机森林在情境感知推荐中的应用研究[J]. 计算机工程与应用, 2020, 56(22): 132-141.
[12]	王永贵，郭昕彤. SparkSql上自适应数据集的高效频繁集挖掘算法[J]. 计算机工程与应用, 2020, 56(21): 72-78.
[13]	张萌，孙秉珍，楚晓丽. 基于邻域代价敏感三支决策的痛风诊断模型[J]. 计算机工程与应用, 2020, 56(16): 218-225.
[14]	刘军，李威，吴梦婷，陈起凤. Hadoop平台下新型图像并行处理模型设计[J]. 计算机工程与应用, 2019, 55(6): 186-190.
[15]	邬阳阳，汤建国. 大数据背景下粗糙集属性约简研究进展[J]. 计算机工程与应用, 2019, 55(6): 31-38.