计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (21): 87-92.DOI: 10.3778/j.issn.1002-8331.1807-0289

• 大数据与云计算 • 上一篇    下一篇

基于协处理器的HBase二级索引方法

郭红,周健倩,张瑛瑛,郭昆   

  1. 1.福州大学 数学与计算机科学学院,福州 350116
    2.福建省网络计算与智能信息处理重点实验室,福州 350116
    3.空间数据挖掘与信息共享教育部重点实验室,福州 350116
    4.国网信通亿力科技有限责任公司,福州 350003
  • 出版日期:2019-11-01 发布日期:2019-10-30

Hbase Secondary Index Method Based on Coprocessor

GUO Hong, ZHOU Jianqian, ZHANG Yingying, GUO Kun   

  1. 1.College of Mathematics and Computer Sciences, Fuzhou University, Fuzhou 350116, China
    2.Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou 350116, China
    3.Key Laboratory of Spatial Data Mining & Information Sharing, Ministry of Education, Fuzhou 350116, China
    4.Power Science and Technology Corporation State Grid Information & Telecommunication Group, Fuzhou 350003, China
  • Online:2019-11-01 Published:2019-10-30

摘要: 在大数据时代,海量的非结构化数据增速远大于结构化数据,HBase被广泛用于海量非结构化数据存储中。由于HBase内置的索引是基于行键(rowkey)设计的,具有很高的查询效率。但是,在根据字段进行条件查询时需要进行全表扫描,性能较低,无法应用于实时场景。针对此问题,提出一种基于协处理器(coprocessor)的HBase二级索引方法。该方法将经常需要查询的字段通过协处理器在HBase中建立映射到行键的索引,在查询时并行扫描索引数据获取行键,并利用行键快速查询记录。同时,在创建表时,通过对Region进行预分区。在插入数据时,在行键中添加Hash值。这不仅能提高数据插入速度,也避免了热点数据现象,同时保证索引数据和主数据位于同一个Region上,查询时就能减少一次RPC请求。在模拟数据集上的实验表明:提出的二级索引方法具有较好的查询性能。不仅高于HBase自带的过滤查询,也高于基于ElasticSearch的二级索引。同时,其空间开销小于基于ElasticSearch的二级索引。

关键词: HBase, 二级索引, 协处理器, ElasticSearch

Abstract: In the era of big data, massive unstructured data grows much faster than the structured data. HBase is widely used in massive unstructured data storage. The built-in index of HBase is designed on rowkeys. Therefore, the query performance of HBase is very high. However, when confronted with conditional queries which require full table scanning, the performance HBase degrades sharply. Hence, it cannot be applied directly to the real-time scenario. In order to solve the problem, a coprocessor-based HBase secondary indexing method is proposed. The method creates indices which  map frequently queried fields to the row keys through the coprocessor of HBase. The indices are scanned in parallel to obtain the row keys, which are used to quickly locate the records. When the tables are created, the regions are pre-partitioned. When inserting data, a Hash values are added to the row keys. This can not only improve the insertion speed, but also avoid the phenomenon of hot data. It is guaranteed that the index data and the main data are located in the same regions, which can reduce one RPC request for each query. Experiments on simulated data sets show that the proposed method’s performance is quite competitive. It runs not only faster than HBase’s filter query, but also faster than the secondary index based on ElasticSearch. At the same time, the method’s space consumption is lower than the secondary index based on ElasticSearch.

Key words: HBase, secondary index, coprocessor, ElasticSearch