Research on storage and query of large-scale multidimensional data

Abstract

Abstract: The OLAP（Online Analytical Processing） system built on warehouse is the most popular tool to analyze large-scale multidimensional data. With the development of information technology, data volume grows rapidly and data structure becomes more and more complicated, so the performance of OLAP system has dropped severely, failing to meet daily data analysis needs. This paper proposes new methods to store large-scale multidimensional data and perform aggregation query with Hadoop, a parallel computing system. The paper implements a new column-store format HCFile（HDFS column file）, and proposals a new storage solution based on it. This project can improve the efficiency of aggregation, with a good scalability. Meanwhile, this paper leverages the hierarchy schema to build dimension hierarchy index, and uses MapReduce to perform efficiency aggregation query. Through comparison experiments with Hive, it proves that the proposed storage solution and aggregation query can effectively improve the efficiency of large-scale multidimensional data analysis.

Key words: large-scale multidimensional data, Hadoop, data index, aggregation query

摘要： 基于数据仓库的OLAP系统是当前海量多维数据分析的主要工具。随着信息技术的发展，海量多维数据的规模急剧增长，结构日益复杂，OLAP系统的性能严重下降，已经无法满足人们的数据分析需求。基于分布式计算系统Hadoop给出了新的海量多维数据的存储方法和查询方法。设计了HDFS上的列存储文件格式HCFile，基于HCFile给出了海量多维数据存储方案，该方案能够提高聚集计算效率，并有很好的可扩展性。同时，利用多维数据的层次性语义特征，设计了维层次索引，并给出了利用维层次索引和MapReduce进行聚集计算的方法。通过和Hive的对比实验，表明了数据存储方案和查询方法能够有效提高海量多维数据分析的性能。

关键词: 海量多维数据, Hadoop, 数据索引, 聚集查询

SONG Aibo, WAN Yutong, GONG Huan, XUE Yingying. Research on storage and query of large-scale multidimensional data[J]. Computer Engineering and Applications, 2016, 52(13): 25-31.

宋爱波，万雨桐，贡欢，薛荧荧. 海量多维数据的存储与查询研究[J]. 计算机工程与应用, 2016, 52(13): 25-31.

[1]	WU Dongyang, DOU Jianping, LI Jun. Design of Digital Twin System for Quadrotor [J]. Computer Engineering and Applications, 2021, 57(16): 237-244.
[2]	LI Leixiao, DENG Dan, LI Jie, WANG Yongsheng. All-to-All Comparison Computing Data Distribution Strategy Based on Particle Swarm Optimization [J]. Computer Engineering and Applications, 2021, 57(15): 109-117.
[3]	LIU Jun, LI Wei, WU Mengting, CHEN Qifeng. New Design of Image Parallel Processing Model Based on Hadoop Platform [J]. Computer Engineering and Applications, 2019, 55(6): 186-190.
[4]	WANG Jingyu, LUAN Junqing, TAN Yuesheng. Research on Big Data Access Control Model Based on Data Sensitivity [J]. Computer Engineering and Applications, 2019, 55(23): 70-77.
[5]	JI Changqing1，2, XIAO Peng3, LIU Chang4, WANG Zumin2, XI Fang2, SHAO Yinbo1, LI Zeyu2. Mobile Medical Call Algorithms Based on Spatial kNN Query [J]. Computer Engineering and Applications, 2019, 55(2): 206-212.
[6]	YIN Qiao1，2, WEI Zhanchen1，2, HUANG Qiulan1, SUN Gongxing1, SHI Jingyan1. Development and Application of Hadoop Massive Data Migration System [J]. Computer Engineering and Applications, 2019, 55(13): 66-71.
[7]	CAO Jingjing1, REN Xinxin2, XU Xianhao2. Research on Logistics Path Frequent Patterns Based on Parallel Apriori [J]. Computer Engineering and Applications, 2019, 55(11): 257-264.
[8]	WU Yaoyao1, YANG Geng1，2. Distributed File System Load Balancing in Cloud Environment [J]. Computer Engineering and Applications, 2019, 55(10): 67-72.
[9]	MA Zhen, HALIDAN Abudureyimu, LI Xitong. Research on access optimization of small files in massive sample data sets [J]. Computer Engineering and Applications, 2018, 54(22): 80-84.
[10]	WANG Yongchao, LU Mingming. Research and implementation of big data migration for financial industry [J]. Computer Engineering and Applications, 2018, 54(13): 93-99.
[11]	ZHANG Renqi, LI Jianhua, FAN Lei. Research on parallel strategy of convolution neural network in distributed environment [J]. Computer Engineering and Applications, 2017, 53(8): 1-7.
[12]	XIA Xiaoyun, ZHANG Renbin, XIE Rui, WANG Cong. MapReduce approach for defect inspection of TFT-LCD [J]. Computer Engineering and Applications, 2017, 53(5): 202-206.
[13]	MIAO Xiaolong1, CHEN Hao1, ZHONG Jiang2. Energy-conserving strategies of file storage based on cluster scale adjustment [J]. Computer Engineering and Applications, 2017, 53(24): 80-85.
[14]	LIU Shuoyang, ZHOU Lijuan, REN Zhongshan, ZHANG Shudong. HDFS load balancing in ophthalmic medical image file access [J]. Computer Engineering and Applications, 2017, 53(2): 253-259.
[15]	FENG Xingjie, HE Yang. Improvement of job scheduling algorithm on Hadoop [J]. Computer Engineering and Applications, 2017, 53(12): 85-91.

Research on storage and query of large-scale multidimensional data

海量多维数据的存储与查询研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics