计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (15): 82-88.DOI: 10.3778/j.issn.1002-8331.1803-0386

• 大数据与云计算 • 上一篇    下一篇

编码单位可变的倒排索引压缩算法研究

安兆翔,瞿有利   

  1. 北京交通大学 计算机与信息技术学院,北京 100044
  • 出版日期:2019-08-01 发布日期:2019-07-26

Research on Inverted Index Compression Algorithm with Variable Coding Units

AN Zhaoxiang, QU Youli   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Online:2019-08-01 Published:2019-07-26

摘要: 倒排索引是大多数大型文本搜索系统的核心数据结构,索引压缩可以有效地减少倒排索引的空间占用,提升检索效率。针对倒排索引压缩算法中的字节对齐编码进行研究,对于其压缩率不够优秀的问题,提出了分区可变单位编码(PVU编码)。算法以可变单位方式代替固定字节存储,使实际存储空间更加贴合原码长度,从而提高压缩效果。针对序列均匀分区并非最优分区的问题,提出将最优分区问题转化为图论中最短路径问题的方法,使用Dijkstra算法求解序列的最优编码分区。通过对比实验验证了改进优化的PVU编码相较于传统的字节对齐编码能够更好地压缩倒排索引序列。

关键词: 倒排索引, 索引压缩, 可变单位, 分区优化

Abstract: The inverted index is the data structure at the core of most large-scale search systems. Index compression can effectively reduce the space occupied by inverted index and improve retrieval efficiency. To solve the problem of poor compression of byte-aligned coding, Partitioned Variable Unit(PVU) coding is proposed. Instead of storing multiple bytes, PVU stores integers in multiple units. Units include multiple values. This method makes the storage space more suitable for the original code length and has a better compression effect. To solve the optimal partition problem, the partition problem is transformed into the shortest path problem in graph theory. Dijkstra’s algorithm is designed to solve this problem. Experiments show that the optimized PVU can compress the inverted index sequence better than the traditional byte-aligned coding.

Key words: inverted index, index compression, variable unit, partition optimization