基于协处理器的HBase二级索引方法

doi:10.3778/j.issn.1002-8331.1807-0289

计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (21): 87-92.DOI: 10.3778/j.issn.1002-8331.1807-0289

基于协处理器的HBase二级索引方法

郭红，周健倩，张瑛瑛，郭昆

1.福州大学数学与计算机科学学院，福州 350116
2.福建省网络计算与智能信息处理重点实验室，福州 350116
3.空间数据挖掘与信息共享教育部重点实验室，福州 350116
4.国网信通亿力科技有限责任公司，福州 350003

出版日期:2019-11-01 发布日期:2019-10-30

Hbase Secondary Index Method Based on Coprocessor

GUO Hong, ZHOU Jianqian, ZHANG Yingying, GUO Kun

1.College of Mathematics and Computer Sciences, Fuzhou University, Fuzhou 350116, China
2.Fujian Provincial Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou 350116, China
3.Key Laboratory of Spatial Data Mining & Information Sharing, Ministry of Education, Fuzhou 350116, China
4.Power Science and Technology Corporation State Grid Information & Telecommunication Group, Fuzhou 350003, China

Online:2019-11-01 Published:2019-10-30

摘要/Abstract

摘要： 在大数据时代，海量的非结构化数据增速远大于结构化数据，HBase被广泛用于海量非结构化数据存储中。由于HBase内置的索引是基于行键（rowkey）设计的，具有很高的查询效率。但是，在根据字段进行条件查询时需要进行全表扫描，性能较低，无法应用于实时场景。针对此问题，提出一种基于协处理器（coprocessor）的HBase二级索引方法。该方法将经常需要查询的字段通过协处理器在HBase中建立映射到行键的索引，在查询时并行扫描索引数据获取行键，并利用行键快速查询记录。同时，在创建表时，通过对Region进行预分区。在插入数据时，在行键中添加Hash值。这不仅能提高数据插入速度，也避免了热点数据现象，同时保证索引数据和主数据位于同一个Region上，查询时就能减少一次RPC请求。在模拟数据集上的实验表明：提出的二级索引方法具有较好的查询性能。不仅高于HBase自带的过滤查询，也高于基于ElasticSearch的二级索引。同时，其空间开销小于基于ElasticSearch的二级索引。

关键词: HBase, 二级索引, 协处理器, ElasticSearch

Abstract: In the era of big data, massive unstructured data grows much faster than the structured data. HBase is widely used in massive unstructured data storage. The built-in index of HBase is designed on rowkeys. Therefore, the query performance of HBase is very high. However, when confronted with conditional queries which require full table scanning, the performance HBase degrades sharply. Hence, it cannot be applied directly to the real-time scenario. In order to solve the problem, a coprocessor-based HBase secondary indexing method is proposed. The method creates indices which map frequently queried fields to the row keys through the coprocessor of HBase. The indices are scanned in parallel to obtain the row keys, which are used to quickly locate the records. When the tables are created, the regions are pre-partitioned. When inserting data, a Hash values are added to the row keys. This can not only improve the insertion speed, but also avoid the phenomenon of hot data. It is guaranteed that the index data and the main data are located in the same regions, which can reduce one RPC request for each query. Experiments on simulated data sets show that the proposed method’s performance is quite competitive. It runs not only faster than HBase’s filter query, but also faster than the secondary index based on ElasticSearch. At the same time, the method’s space consumption is lower than the secondary index based on ElasticSearch.

Key words: HBase, secondary index, coprocessor, ElasticSearch

郭红，周健倩，张瑛瑛，郭昆. 基于协处理器的HBase二级索引方法[J]. 计算机工程与应用, 2019, 55(21): 87-92.

GUO Hong, ZHOU Jianqian, ZHANG Yingying, GUO Kun. Hbase Secondary Index Method Based on Coprocessor[J]. Computer Engineering and Applications, 2019, 55(21): 87-92.

[1]	尤国华，刘媛，高东. 异构系统中的Web服务器软件框架研究[J]. 计算机工程与应用, 2020, 56(11): 33-38.
[2]	朱松杰，娄渊胜，叶枫，李凌，陈勇. 基于协处理器的HBase内存索引机制的研究[J]. 计算机工程与应用, 2020, 56(1): 98-105.
[3]	马振，哈力旦·阿布都热依木，李希彤. 海量样本数据集中小文件的存取优化研究[J]. 计算机工程与应用, 2018, 54(22): 80-84.
[4]	徐熙超1，2，杨铮1，马廷淮1，2. 基于HBase的气象结构化数据查询优化[J]. 计算机工程与应用, 2017, 53(9): 80-84.
[5]	贾贺1，艾中良1，2，贾高峰2，刘忠麟1，2，陈伯雄2. 基于Solr的司法大数据检索模型研究与实现[J]. 计算机工程与应用, 2017, 53(20): 249-253.
[6]	陈亚楠，朱习军. 基于Hadoop的中医哮喘用药组合关联分析[J]. 计算机工程与应用, 2017, 53(13): 95-98.
[7]	黄一才，郁滨. 基于DSP协处理器的蓝牙安全传输方案设计[J]. 计算机工程与应用, 2014, 50(16): 81-85.
[8]	高路，郭立，韩琼磊，杨帆. 使用SIMD协处理器的高性能声码器[J]. 计算机工程与应用, 2009, 45(36): 66-70.
[9]	张庆扬,柴胜. 使用二级索引的中文分词词典[J]. 计算机工程与应用, 2009, 45(19): 139-141.

基于协处理器的HBase二级索引方法

Hbase Secondary Index Method Based on Coprocessor

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐

Metrics