Performance analysis of Hadoop for handling small files in single node

Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (3): 57-60.

Previous Articles Next Articles

Performance analysis of Hadoop for handling small files in single node

YUAN Yu1，2, CUI Chaoyuan2, WU Yun2 , CHEN Zhuhong2，3

1.College of Computing & Communication Engineering, Graduate University of the Chinese Academy of Sciences, Beijing 100049, China
2.Institute of Intelligent Machine, Chinese Academy of Sciences, Hefei 230031, China
3.School?of?Information?Science?and Technology, University of Science and Technology of China, Hefei 230026, China

Online:2013-02-01 Published:2013-02-18

单机下Hadoop小文件处理性能分析

袁玉1，2，崔超远2，乌云2，陈祝红2，3

1.中国科学院研究生院计算与通信工程学院，北京 100049
2.中国科学院合肥智能机械研究所，合肥 230031
3.中国科技大学信息科学技术学院，合肥 230026

Abstract

Abstract: Hadoop is a software framework that supports distributed processing of large data sets, that is it works well with large files. There’s a doubt whether it also works well with small files. Taking the word frequency statistic as an example, through experiments with some typical file sets in a single node, Hadoop’s performance on small files under different FileInputFormat is compared. And the performance differences are explained by Hadoop’s own execution principle. Through analysis, packing many small files into one split can improve Hadoop performance.

Key words: Hadoop, Hadoop Distributed File System（HDFS）, MapReduce, small files handling, FileInputFormat

摘要： Hadoop主要是针对大量数据进行分布式处理的软件框架，即适合于处理大文件，但它们是否也适合处理小文件值得商榷。以词频统计为例，通过在单机环境下一些典型文件测试集的实验，对比了不同文件输入格式对Hadoop处理小文件性能的差异。从Hadoop的工作流程和原理上解释了出现此性能差异的原因。通过分析得出多个小文件整合为一个数据片split有助于改善Hadoop处理小文件性能。

关键词: Hadoop, Hadoop的分布式文件系统（HDFS）, MapReduce, 小文件处理, 文件输入格式

YUAN Yu1，2, CUI Chaoyuan2, WU Yun2,CHEN Zhuhong2，3. Performance analysis of Hadoop for handling small files in single node[J]. Computer Engineering and Applications, 2013, 49(3): 57-60.

袁玉1，2，崔超远2，乌云2，陈祝红2，3. 单机下Hadoop小文件处理性能分析[J]. 计算机工程与应用, 2013, 49(3): 57-60.

[1]	WU Dongyang, DOU Jianping, LI Jun. Design of Digital Twin System for Quadrotor [J]. Computer Engineering and Applications, 2021, 57(16): 237-244.
[2]	LI Leixiao, DENG Dan, LI Jie, WANG Yongsheng. All-to-All Comparison Computing Data Distribution Strategy Based on Particle Swarm Optimization [J]. Computer Engineering and Applications, 2021, 57(15): 109-117.
[3]	CHEN Yuanwen. Application of MapReduce Technology in Problem of Material Transportation and Stowage [J]. Computer Engineering and Applications, 2021, 57(12): 273-278.
[4]	LIU Jun, LI Wei, WU Mengting, CHEN Qifeng. New Design of Image Parallel Processing Model Based on Hadoop Platform [J]. Computer Engineering and Applications, 2019, 55(6): 186-190.
[5]	WANG Jingyu, LUAN Junqing, TAN Yuesheng. Research on Big Data Access Control Model Based on Data Sensitivity [J]. Computer Engineering and Applications, 2019, 55(23): 70-77.
[6]	JI Changqing1，2, XIAO Peng3, LIU Chang4, WANG Zumin2, XI Fang2, SHAO Yinbo1, LI Zeyu2. Mobile Medical Call Algorithms Based on Spatial kNN Query [J]. Computer Engineering and Applications, 2019, 55(2): 206-212.
[7]	YIN Qiao1，2, WEI Zhanchen1，2, HUANG Qiulan1, SUN Gongxing1, SHI Jingyan1. Development and Application of Hadoop Massive Data Migration System [J]. Computer Engineering and Applications, 2019, 55(13): 66-71.
[8]	CAO Jingjing1, REN Xinxin2, XU Xianhao2. Research on Logistics Path Frequent Patterns Based on Parallel Apriori [J]. Computer Engineering and Applications, 2019, 55(11): 257-264.
[9]	WU Yaoyao1, YANG Geng1，2. Distributed File System Load Balancing in Cloud Environment [J]. Computer Engineering and Applications, 2019, 55(10): 67-72.
[10]	WANG Dezheng1, ZHANG Yinong1, YANG Fan2. Implementation of parallel PLS algorithm of process monitoring using MapReduce [J]. Computer Engineering and Applications, 2018, 54(24): 61-65.
[11]	MA Zhen, HALIDAN Abudureyimu, LI Xitong. Research on access optimization of small files in massive sample data sets [J]. Computer Engineering and Applications, 2018, 54(22): 80-84.
[12]	CHEN Wanghu, YU Maoyi, MA Shengjun. Training BP neural networks with MapReduce based on sample data slice disruptions [J]. Computer Engineering and Applications, 2018, 54(2): 137-143.
[13]	WANG Yongchao, LU Mingming. Research and implementation of big data migration for financial industry [J]. Computer Engineering and Applications, 2018, 54(13): 93-99.
[14]	ZHANG Renqi, LI Jianhua, FAN Lei. Research on parallel strategy of convolution neural network in distributed environment [J]. Computer Engineering and Applications, 2017, 53(8): 1-7.
[15]	XIA Xiaoyun, ZHANG Renbin, XIE Rui, WANG Cong. MapReduce approach for defect inspection of TFT-LCD [J]. Computer Engineering and Applications, 2017, 53(5): 202-206.

Performance analysis of Hadoop for handling small files in single node

单机下Hadoop小文件处理性能分析

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics