计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (22): 80-84.DOI: 10.3778/j.issn.1002-8331.1806-0350

• 大数据与云计算 • 上一篇    下一篇

海量样本数据集中小文件的存取优化研究

马  振,哈力旦·阿布都热依木,李希彤   

  1. 新疆大学 电气工程学院,乌鲁木齐 830047
  • 出版日期:2018-11-15 发布日期:2018-11-13

Research on access optimization of small files in massive sample data sets

MA Zhen, HALIDAN Abudureyimu, LI Xitong   

  1. School of Electrical Engineering, Xinjiang University, Urumchi 830047, China
  • Online:2018-11-15 Published:2018-11-13

摘要: 针对Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)在海量样本数据集存储方面存在内存占用多、读取效率低的问题,以及分布式数据库HBase在存储文件名重复度和类似度高时产生访问热点的问题,结合样本数据集的特点、类型,提出一种面向样本数据集存取优化方案,优化样本数据集中小文件的写入、读取、添加、删除和替换策略。该方案根据硬件配置测得大、小文件的分界点,通过变尺度堆栈算法按样本数据集的目录结构将小文件合并存储至HDFS;结合行键优化策略将文件索引存储在HBase数据表中;搭建基于Ehcache缓存框架的预取机制。实验结果表明,该方案降低了主节点的内存消耗,提高了文件的读取效率,实现了对海量样本数据集中小文件的高效存取。

关键词: Hadoop分布式文件系统(HDFS), 小文件, 样本数据集, 缓存预取, 分布式数据库, HBase

Abstract: For the Hadoop Distributed File System(HDFS), there are problems of large memory usage and low reading efficiency in the storage of massive sample data sets, and the problem of generating access hotspots when the repeatability and similarity of storage file name are high for the distributed database HBase. Combined with the characteristics and types of sample data sets, a sample data sets access optimization scheme is proposed to optimize the writing, reading, adding, deleting and replacing of small files in the sample data sets. The scheme measures the demarcation point of large and small files according to the hardware configuration and stores the small files into HDFS by the variable scale stack algorithm according to the directory structure of the sample data sets, then stores the file index in the HBase data table with the row-key optimization strategy and builds the prefetching mechanism based on the Ehcache cache frame. The experimental results show that the scheme reduces the memory consumption of the master node, improves the reading efficiency of the files and achieves efficient access to small files in the massive sample data sets.

Key words: Hadoop Distributed File System(HDFS), small file, sample data set, cache prefetch, distributed database, HBase