海量多维数据的存储与查询研究

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (13): 25-31.

海量多维数据的存储与查询研究

宋爱波，万雨桐，贡欢，薛荧荧

东南大学计算机科学与工程学院，南京 211189

出版日期:2016-07-01 发布日期:2016-07-15

Research on storage and query of large-scale multidimensional data

SONG Aibo, WAN Yutong, GONG Huan, XUE Yingying

School of Computer Science and Engineering, Southeast University, Nanjing 211189, China

Online:2016-07-01 Published:2016-07-15

摘要/Abstract

摘要： 基于数据仓库的OLAP系统是当前海量多维数据分析的主要工具。随着信息技术的发展，海量多维数据的规模急剧增长，结构日益复杂，OLAP系统的性能严重下降，已经无法满足人们的数据分析需求。基于分布式计算系统Hadoop给出了新的海量多维数据的存储方法和查询方法。设计了HDFS上的列存储文件格式HCFile，基于HCFile给出了海量多维数据存储方案，该方案能够提高聚集计算效率，并有很好的可扩展性。同时，利用多维数据的层次性语义特征，设计了维层次索引，并给出了利用维层次索引和MapReduce进行聚集计算的方法。通过和Hive的对比实验，表明了数据存储方案和查询方法能够有效提高海量多维数据分析的性能。

关键词: 海量多维数据, Hadoop, 数据索引, 聚集查询

Abstract: The OLAP（Online Analytical Processing） system built on warehouse is the most popular tool to analyze large-scale multidimensional data. With the development of information technology, data volume grows rapidly and data structure becomes more and more complicated, so the performance of OLAP system has dropped severely, failing to meet daily data analysis needs. This paper proposes new methods to store large-scale multidimensional data and perform aggregation query with Hadoop, a parallel computing system. The paper implements a new column-store format HCFile（HDFS column file）, and proposals a new storage solution based on it. This project can improve the efficiency of aggregation, with a good scalability. Meanwhile, this paper leverages the hierarchy schema to build dimension hierarchy index, and uses MapReduce to perform efficiency aggregation query. Through comparison experiments with Hive, it proves that the proposed storage solution and aggregation query can effectively improve the efficiency of large-scale multidimensional data analysis.

Key words: large-scale multidimensional data, Hadoop, data index, aggregation query

宋爱波，万雨桐，贡欢，薛荧荧. 海量多维数据的存储与查询研究[J]. 计算机工程与应用, 2016, 52(13): 25-31.

SONG Aibo, WAN Yutong, GONG Huan, XUE Yingying. Research on storage and query of large-scale multidimensional data[J]. Computer Engineering and Applications, 2016, 52(13): 25-31.

[1]	吴东阳，窦建平，李俊. 四旋翼飞行器的数字孪生系统设计[J]. 计算机工程与应用, 2021, 57(16): 237-244.
[2]	李雷孝，邓丹，李杰，王永生. 基于粒子群优化的全比较计算数据分发策略[J]. 计算机工程与应用, 2021, 57(15): 109-117.
[3]	刘军，李威，吴梦婷，陈起凤. Hadoop平台下新型图像并行处理模型设计[J]. 计算机工程与应用, 2019, 55(6): 186-190.
[4]	王静宇，栾俊清，谭跃生. 基于数据敏感性的大数据访问控制模型研究[J]. 计算机工程与应用, 2019, 55(23): 70-77.
[5]	季长清1，2，肖鹏3，刘畅4，汪祖民2，西方2，邵寅博1，李泽宇2. 基于空间近邻查询的移动医疗呼叫算法[J]. 计算机工程与应用, 2019, 55(2): 206-212.
[6]	尹乔1，2，魏占辰1，2，黄秋兰1，孙功星1，石京燕1. Hadoop海量数据迁移系统开发及应用[J]. 计算机工程与应用, 2019, 55(13): 66-71.
[7]	曹菁菁1，任欣欣2，徐贤浩2. 基于并行Apriori的物流路径频繁模式研究[J]. 计算机工程与应用, 2019, 55(11): 257-264.
[8]	吴瑶瑶1，杨庚1，2. 云环境下分布式文件系统负载均衡研究[J]. 计算机工程与应用, 2019, 55(10): 67-72.
[9]	孙舟1，田贺平1，潘鸣宇1，王伟贤1，张禄1，陈光2. 有效解决数据缺失问题的聚集查询算法[J]. 计算机工程与应用, 2018, 54(24): 72-78.
[10]	马振，哈力旦·阿布都热依木，李希彤. 海量样本数据集中小文件的存取优化研究[J]. 计算机工程与应用, 2018, 54(22): 80-84.
[11]	王永超，鲁鸣鸣. 面向金融行业的大数据迁移的研究与实现[J]. 计算机工程与应用, 2018, 54(13): 93-99.
[12]	张任其，李建华，范磊. 分布式环境下卷积神经网络并行策略研究[J]. 计算机工程与应用, 2017, 53(8): 1-7.
[13]	夏晓云，张仁斌，谢瑞，王聪. 基于MapReduce的液晶屏缺陷检测方法[J]. 计算机工程与应用, 2017, 53(5): 202-206.
[14]	妙晓龙1，陈浩1，钟将2. 基于集群规模调整的节能存储策略[J]. 计算机工程与应用, 2017, 53(24): 80-85.
[15]	刘烁阳，周丽娟，任仲山，张树东. 眼科医疗影像文件存取下的HDFS负载均衡[J]. 计算机工程与应用, 2017, 53(2): 253-259.