Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (18): 74-81.DOI: 10.3778/j.issn.1002-8331.1801-0266

Previous Articles     Next Articles

Research on distributed storage with hybrid column-row strategy using Trevni

WEN Weidong1, LI Yang2, LI Wenhai2   

  1. 1.State Key Laboratory of Software Engineering of China, Wuhan 430072, China
    2.School of Computer, Wuhan University, Wuhan 430072, China
  • Online:2018-09-15 Published:2018-10-16


文卫东1,李  鸯2,李文海2   

  1. 1.软件工程国家重点实验室,武汉 430072
    2.武汉大学 计算机学院,武汉 430072

Abstract: A hybrid column-row strategy is advanced to speed up the queries on both tree-structured and cascading relational datasets. Existing row and column-based storages are analyzed to motivate the schema-driven design of the hybrid storage. By integrating a special data type in the state-of-art nested schemas, a column-row hybrid scheme is presented to reduce the assembly cost in general column-based storages, while common queries can also be executed on the grouping columns instead of processing on the separate columns or rows. Based on the open-source platform Trevni/Avro, it realizes the scheme and highlights the benefits of the optimized union-based physical design for nested schemas and adapts to NULL values in general cases when hierarchical entities need to be flattened. The experiments are conducted on one billion TPCH records, with three flat tables hierarchically organized for three modified scan-intensive queries. The efficiency of all the involved queries on the proposed schema outperforms the competitors by at least 7X in a general-purpose cluster.

Key words: nested schema, column storage, grouping strategy, TPCH, DBMS

摘要: 为提升树形结构模式和级联关系模式下的查询执行效率,提出一种行列混合式存储方法。通过在列存中引入分组概念,形成逻辑上完整但局部上独立的列组物理单元。研究分析了现有单纯行存储和列存储的优势和潜在不足,并在此基础上通过模式驱动对这一存储方法进行物理设计,使得研究能够适用主流的列存架构。基于开源框架Avro的列存内核Trevni,研究对所提方法予以实现以期显著降低列存到元组转换过程中的开销,同时保证数据交换仅限于查询所需的列。为提高在复杂模式下的可用性,基于union对存储结构进行优化,使得访问能够集中于有效的单元中,并基于空值支持关系查询场景中不满足外关键字约束的模式。实验基于十亿条TPCH数据进行,通过构建三层嵌套分组模式执行查询。结果表明,所提方法较传统行列存储方法效率有显著提升。

关键词: 嵌套模式, 列存, 分组策略, TPCH, 数据库