计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (11): 84-94.DOI: 10.3778/j.issn.1002-8331.2004-0309

• 大数据与云计算 • 上一篇    下一篇

向量分组聚集计算技术研究

张宇,张延松   

  1. 1.国家卫星气象中心,北京 100081
    2.中国人民大学 信息学院,北京 100872
  • 出版日期:2021-06-01 发布日期:2021-05-31

Research on Vector Grouping Aggregation Technology

ZHANG Yu, ZHANG Yansong   

  1. 1.National Satellite Meteorological Centre, Beijing 100081, China
    2.School of Information, Renmin University of China, Beijing 100872, China
  • Online:2021-06-01 Published:2021-05-31

摘要:

分组聚集计算是OLAP重要的操作符之一,分组聚集操作是一种数据密集型负载。在内存数据库和GPU数据库应用场景下不仅需要研究其性能优化技术,还需要研究如何优化分配分组聚集计算执行场地以最小化CPU与GPU之间的数据传输代价。针对异构计算平台的硬件特征提出了向量聚集计算技术,将位于传统流水线末端的分组聚集计算按照“早分组,晚聚集”策略进行分解与下推,实现将数据密集型的分组聚集计算从流水线中分离,将操作与处理器计算特性优化匹配,实现异构计算平台上最优的负载分配。通过将传统基于哈希分组的聚集计算转换为向量分组聚集计算,显著提升了分组聚集计算性能。实验结果表明,向量分组聚集技术相对于具有代表性的高性能内存数据库Hyper、GPU数据库MapD最大达到5~8倍的性能提升。向量聚集计算不仅提高了OLAP聚集计算性能,而且实现了将数据密集型负载从查询计划中分离的目标,使异构计算平台能够根据处理器的硬件特性优化配置计算资源,提高异构计算平台OLAP的整体性能。

关键词: CPU-GPU异构计算平台, 向量分组聚集, 分组向量索引, 数据密集型负载

Abstract:

The grouping & aggregation operation is one of the important OLAP operator, and it is data-intensive workload. In main-memory database and GPU database scenarios, not only the performance optimizations are to be studied but also how to optimally assign the executing platform for grouping & aggregation operation to minimize data transmission overhead between CPU and GPU should be focused. This paper presents vector grouping aggregation method, the traditional grouping & aggregation operation is separated from the pipeline by “early grouping, late aggregating” strategy, so that the data-intensive grouping & aggregation operation is separated from the pipeline. Moreover, the optimal workload distribution is achieved by matching processor hardware characteristics with algorithm pattern. Vector grouping aggregation achieves dramatically performance improvements against traditional hash based grouping aggregation operation. The experimental results show that the maximal performance gains between vector grouping aggregation algorithm against the leading main-memory database Hyper and GPU database MapD achieves 5~8 times improvements. The vector grouping aggregation approach not only improves the performance of OLAP aggregation, but also separates the data-intensive workload from query plan. The heterogeneous computing platform can optimally configure the computing resources according to hardware characteristics to improve the overall OLAP performance with hybrid processors.

Key words: CPU-GPU heterogeneous computing platform, vector grouping &, aggregation, group vector index, computing-intensive workload