计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (21): 60-64.DOI: 10.3778/j.issn.1002-8331.1912-0278

• 大数据与云计算 • 上一篇    下一篇

基于随机森林的高能物理数据放置策略

程振京,程耀东,陈刚,汪璐,李海波,胡庆宝   

  1. 1.中国科学院 高能物理研究所,北京100049
    2.中国科学院大学,北京100049
    3.中国科学院 高能物理研究所 天府宇宙线研究中心,成都 610041
  • 出版日期:2020-11-01 发布日期:2020-11-03

High Energy Physics Data Placement Strategy Based on Random Forest

CHENG Zhenjing, CHENG Yaodong, CHEN Gang, WANG Lu, LI Haibo, HU Qingbao   

  1. 1.Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
    3.Tianfu Cosmic Ray Research Center, Institute of High Energy Physics, Chinese Academy of Sciences, Chengdu 610041, China
  • Online:2020-11-01 Published:2020-11-03

摘要:

随着LHAASO高海拔宇宙线等高能物理实验规模的不断扩大,每年需要存储PB级的海量物理数据。高能物理海量存储系统一般采用随机的数据放置策略,没有考虑数据访问场景和服务器节点、存储设备的差异性。针对以上问题,提出一种异构存储环境下基于随机森林算法的数据放置策略,根据存储设备性能差异划分快慢存储池,同时对后期文件的读写访问场景进行预测和识别,综合考虑当前设备负载为数据找到最佳的放置位置。使用真实物理实验数据验证了算法的有效性。

关键词: 随机森林, 分布式存储系统, 异构存储, 存储池, 数据放置策略, 访问场景

Abstract:

With the continuous developments of high energy physics experiments such as Large High Air Altitude Shower Observatory(LHAASO), a large amount of data at PB scale will be collected, stored and analyzed every year. At present, random data placement strategy which doesn’t fully consider the differences among data access scenarios, servers and storage devices is generally used. A data placement strategy based on random-forest algorithm is proposed. Storage devices are separated into storage pools(Fast pool, Normal pool) according to their performance. The algorithm will predict and identify a new file’s access pattern, and find one best place for it considering the load of target devices. This paper evaluates the performance of the algorithm with data samples collected from production storage system of LHAASO experiment.

Key words: random forest, distributed storage system, heterogeneous storage, storage pool, data placement strategy, access scenario