计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (17): 257-265.DOI: 10.3778/j.issn.1002-8331.2211-0230

• 大数据与云计算 • 上一篇    下一篇

HDFS分级存储系统元数据管理方法的研究

刘晓宇,夏立斌,姜晓巍,孙功星   

  1. 1.中国科学院高能物理研究所,北京 100049
    2.中国科学院大学,北京 100049
  • 出版日期:2023-09-01 发布日期:2023-09-01

Research of Metadata Management Method of Hierarchical Storage System Based on HDFS

LIU Xiaoyu, XIA Libin, JIANG Xiaowei, SUN Gongxing   

  1. 1.Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
  • Online:2023-09-01 Published:2023-09-01

摘要: 随着高能物理实验规模的不断扩大和实验复杂度的提高,研究人员正面临海量数据存储的挑战,考虑到成本、能耗、存储周期及运维管理等问题,具有存储容量大、成本低特点的磁带库成为高能物理领域海量存储系统中必不可少的选择。但HDFS现有异构存储研究不支持磁带库存储,无法满足高能物理Hadoop平台海量实验数据持久化和备份过程对于存储系统高性价比的需求。针对上述问题,为了构建支持磁盘-磁带存储的HDFS分级存储系统,使磁带层文件在HDFS中无缝融合,为用户提供统一的文件系统命名空间,调研了分布式文件系统元数据管理方法,在此基础上设计实现了HDFS分级存储系统中统一的元数据管理方法。该方法通过重新设计内存文件元数据结构,构建分级存储系统统一的内存目录树并实现其访问管理和可靠性保障,完成分级存储系统中不同层级文件元数据的集中统一管理。测试结果表明,该方法实现了分级存储系统异构资源上文件元数据的统一管理,提供了高效的元数据操作。基于该方法构建的分级存储系统可靠性高,在对不同规模大小的文件读写时,其读写吞吐量较优于高能物理领域传统分级存储系统EOSCTA。

关键词: HDFS分布式文件系统, 分级存储系统, 内存元数据管理, 统一命名空间, 持久化

Abstract: With the continuous expansion of the scale of high-energy physics(HEP) experiments and the increase of experimental complexity, researchers are facing the challenge of big data storage. Considering the cost, energy consumption, storage cycle and maintenance management, the tape libraries with?large storage capacity and low cost have become an indispensable choice for mass storage systems in the field of HEP. However, HDFS heterogeneous storage doesn’t support tape library storage, it cannot meet the high cost performance requirements of storage system for the persistence and backup process of massive experimental data in Hadoop platform of HEP. In view of the above problems, in order to build an HDFS hierarchical storage system that supports disk-tape storage, the tape layer files can be seamlessly integrated in HDFS, and provide users with a unified file system namespace. In this research, it first overviews the existing methods of distributed file system metadata management, and further designs an improved one to realize the unified metadata management for HDFS hierarchical storage system. This method redesigns the file metadata structure in memory to build a unified memory directory tree, and implements access management and reliability assurance to achieve centralized and unified management of tape file metadata.?Test results show that the metadata server implements unified management of file metadata on heterogeneous resources and provides efficient metadata operation. The storage tiering system based on this method has high reliability. When reading and writing files of different sizes, the read and write throughput is better than that of the traditional storage tiering system EOSCTA in the HEP field.

Key words: Hadoop distributed file system(HDFS), hierarchical storage system, metadata management in memory, unified namespace, persistence