Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (9): 72-79.DOI: 10.3778/j.issn.1002-8331.1512-0309

Previous Articles     Next Articles

Verification method for string similarity joins based on bi-directional filtering

HUANG Ying, SONG Chunhua, NIU Baoning   

  1. School of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, China
  • Online:2017-05-01 Published:2017-05-15

双向过滤的字符串相似连接验证方法

黄  樱,宋春花,牛保宁   

  1. 太原理工大学 计算机学院,太原 030024

Abstract: A string similarity join finds similar string pairs from two sets of strings. It plays an important role in many real-world applications. Various algorithms have been proposed to address its efficiency issues. Partition-based filter-verification methods, such as Pass-Join, are promising, which quickly screens out possible similar string pairs(candidate set)by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification based on edit-distance. Motivated by the fact that the effect produced by filtering in the descending order of string length is better than in the ascending order, a novel bi-directional filtering-verification mechanism is proposed. At the filtering stage, it pipelines the results from length descending filtering to length ascending filtering to further reduce the size of the candidate set. At the verification stage, it makes use of the two pairs of matched substrings from the bi-directional filtering to partition the target string pairs into several short substring pairs to accelerate the verification process. Experimental results show that the proposed bi-directional filtering-verification algorithm outperforms the origin algorithm on real-world datasets.

Key words: string similarity joins, bi-directional filtering-verification mechanism, filter-verification framework

摘要: 字符串相似连接是指在字符串集合中找出相似的字符串对,是许多应用的关键操作,寻找高效的字符串相似连接算法已成为研究热点。基于划分的过滤-验证方法(Pass-Join)与其他方法相比具有较高的效率。它按照字符串长度递增的顺序访问字符串集合,通过查找一个字符串的划分块是否存在于另一个字符串中,快速筛选出可能相似的字符串对(候选集),然后利用编辑距离进行相似性验证。研究发现,按照字符串长度递减的顺序进行过滤(长度递减过滤)的效果优于按照长度递增的顺序过滤(长度递增过滤)的效果,基于此,提出双向过滤-验证机制:在过滤阶段对长度递减过滤的结果再进行一次长度递增过滤,进一步减小候选集大小;在验证阶段利用双向过滤产生的两对划分块和其匹配子串分隔字符串对,从而减小需要验证的字符串的长度,加速验证过程。实验证明,双向过滤-验证算法在真实数据集上优于原算法。

关键词: 字符串相似连接, 双向过滤-验证机制, 过滤-验证框架