Research on access optimization of small files in massive sample data sets

doi:10.3778/j.issn.1002-8331.1806-0350

Abstract

Abstract: For the Hadoop Distributed File System（HDFS）, there are problems of large memory usage and low reading efficiency in the storage of massive sample data sets, and the problem of generating access hotspots when the repeatability and similarity of storage file name are high for the distributed database HBase. Combined with the characteristics and types of sample data sets, a sample data sets access optimization scheme is proposed to optimize the writing, reading, adding, deleting and replacing of small files in the sample data sets. The scheme measures the demarcation point of large and small files according to the hardware configuration and stores the small files into HDFS by the variable scale stack algorithm according to the directory structure of the sample data sets, then stores the file index in the HBase data table with the row-key optimization strategy and builds the prefetching mechanism based on the Ehcache cache frame. The experimental results show that the scheme reduces the memory consumption of the master node, improves the reading efficiency of the files and achieves efficient access to small files in the massive sample data sets.

Key words: Hadoop Distributed File System（HDFS）, small file, sample data set, cache prefetch, distributed database, HBase

摘要： 针对Hadoop分布式文件系统（Hadoop Distributed File System，HDFS）在海量样本数据集存储方面存在内存占用多、读取效率低的问题，以及分布式数据库HBase在存储文件名重复度和类似度高时产生访问热点的问题，结合样本数据集的特点、类型，提出一种面向样本数据集存取优化方案，优化样本数据集中小文件的写入、读取、添加、删除和替换策略。该方案根据硬件配置测得大、小文件的分界点，通过变尺度堆栈算法按样本数据集的目录结构将小文件合并存储至HDFS；结合行键优化策略将文件索引存储在HBase数据表中；搭建基于Ehcache缓存框架的预取机制。实验结果表明，该方案降低了主节点的内存消耗，提高了文件的读取效率，实现了对海量样本数据集中小文件的高效存取。

关键词: Hadoop分布式文件系统（HDFS）, 小文件, 样本数据集, 缓存预取, 分布式数据库, HBase

MA Zhen, HALIDAN Abudureyimu, LI Xitong. Research on access optimization of small files in massive sample data sets[J]. Computer Engineering and Applications, 2018, 54(22): 80-84.

马振，哈力旦·阿布都热依木，李希彤. 海量样本数据集中小文件的存取优化研究[J]. 计算机工程与应用, 2018, 54(22): 80-84.

[1]	ZHU Songjie, LOU Yuansheng, YE Feng, LI Ling, CHEN Yong. Research and Implementation of HBase Memory Indexing Scheme Based on Coprocessor [J]. Computer Engineering and Applications, 2020, 56(1): 98-105.
[2]	GUO Hong, ZHOU Jianqian, ZHANG Yingying, GUO Kun. Hbase Secondary Index Method Based on Coprocessor [J]. Computer Engineering and Applications, 2019, 55(21): 87-92.
[3]	WU Yaoyao1, YANG Geng1，2. Distributed File System Load Balancing in Cloud Environment [J]. Computer Engineering and Applications, 2019, 55(10): 67-72.
[4]	MIAO Xiaolong1, CHEN Hao1, ZHONG Jiang2. Energy-conserving strategies of file storage based on cluster scale adjustment [J]. Computer Engineering and Applications, 2017, 53(24): 80-85.
[5]	JIA He1, AI Zhongliang1，2, JIA Gaofeng2, LIU Zhonglin1，2, CHEN Boxiong2. Research and realization on judicial large data retrieval model [J]. Computer Engineering and Applications, 2017, 53(20): 249-253.
[6]	LIU Shuoyang, ZHOU Lijuan, REN Zhongshan, ZHANG Shudong. HDFS load balancing in ophthalmic medical image file access [J]. Computer Engineering and Applications, 2017, 53(2): 253-259.
[7]	CHEN Yanan, ZHU Xijun. Association analysis of TCM asthma medication combination based on Hadoop [J]. Computer Engineering and Applications, 2017, 53(13): 95-98.
[8]	LI Sanmiao, LI Longshu. Performance analysis of four methods for handling small files in Hadoop [J]. Computer Engineering and Applications, 2016, 52(9): 44-49.
[9]	YUAN Yu1，2, CUI Chaoyuan2, WU Yun2,CHEN Zhuhong2，3. Performance analysis of Hadoop for handling small files in single node [J]. Computer Engineering and Applications, 2013, 49(3): 57-60.
[10]	FAN Mingsuo, TANG Zhijun, CHEN Huahui, QIAN Jiangbo, DONG Yihong. Continuous probabilistic Skyline queries under distributed environment [J]. Computer Engineering and Applications, 2013, 49(15): 123-129.
[11]	CHEN Ming. Distributed management engine design of bridge alarming system database [J]. Computer Engineering and Applications, 2011, 47(5): 237-241.
[12]	LIANG Xiong-you^1，2，XUE Yong-sheng¹. Concurrency control algorithm of dynamic adjustment of serialization order for distributed database based on transactions [J]. Computer Engineering and Applications, 2010, 46(8): 144-147.
[13]	HAN Jian-hui^1,2,XU Zhen-lin¹,ZHAO Zi-yang². Design of integrated bridge information system [J]. Computer Engineering and Applications, 2009, 45(12): 242-245.
[14]	Song Bao-li 1,2, Qin Zhen1,3. Privacy Preserving Distributed Mining of Association Rules [J]. Computer Engineering and Applications, 2007, 43(6期): 181-183.
[15]	. The Trusted Research of Distributed Database Server System [J]. Computer Engineering and Applications, 2007, 43(4期): 181-185.

Research on access optimization of small files in massive sample data sets

海量样本数据集中小文件的存取优化研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics