Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (23): 198-202.

Previous Articles     Next Articles

Duplicate data delete technology based on double bloom filter

XI Yewen, YANG Jinmin   

  1. School of Information Science and Technology, Hunan University, Changsha 410082, China
  • Online:2014-12-01 Published:2014-12-12

基于双布鲁姆过滤器的数据排重技术

席晔文,杨金民   

  1. 湖南大学 信息科学与工程学院,长沙 410082

Abstract: Aiming at the disadvantage of file level single bloom filter duplicate data delete algorithm deletes duplicate data only at file size, block level single bloom filter duplicate data delete algorithm’s time-consuming is too much. In this paper, it uses 2 bloom filter, creates a 2 level duplicate data delete algorithm structure-file level and block level. The experimental results show that, double bloom filter duplicate data delete algorithm could delete duplicate data at block level, keep false positive error rate at a low level, time-consuming gets 43%~68% shorter compared with block level single bloom filter duplicate data delete algorithm.

Key words: duplicate data delete, query elements, bloom filter, MD5, false positive error rate

摘要: 针对文件级单布鲁姆过滤器排重算法只能以文件为单位进行数据排重,数据块级单布鲁姆过滤器排重算法耗时过多的缺点,采用2个布鲁姆过滤器,创建文件级和数据块级2级数据排重的算法结构。实验结果表明,双布鲁姆过滤器排重算法可以以数据块为单位对数据排重,在保持低假阳性误判率的同时,相比数据块级单布鲁姆过滤器排重算法耗时缩短了43%~68%。

关键词: 重复数据删除, 集合元素查询, 布鲁姆过滤器, MD5, 假阳性误判率