Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (19): 123-127.
Previous Articles Next Articles
GUO Wenlong
Online:
Published:
郭文龙
Abstract: Approximately duplicate records are produced in heterogeneous database integration, but the numbers of which are limited. Using the traditional SNM algorithm to detect approximately duplicate records, needs to compare all records in the window, and the efficiency is not high. For the defects, an improved SNM algorithm based on the length filtering and effective weights is proposed. According to the length proportion of two records in the window, the records which are impossible to be approximately duplicate are excluded firstly, so it can reduce the number of records comparison, and improve the detection efficiency. By setting the validity factor and weight proportion of the records attribute furtherly, it calculates the effective weights, then according to the weights, detects the records. The recall ratio and the precision ratio are improved. The results of experiments show that the improved algorithm is better than SNM algorithm in various performance.
Key words: approximately duplicate records, data cleaning, effective weights, Sorted-Neighborhood Method(SNM)
摘要: 异构数据库集成中产生了相似重复记录,但数量是有限的,采用传统的SNM算法进行检测,需要在窗口内对所有记录进行比对,效率不高。针对这一缺陷,提出一种基于长度过滤和有效权值的SNM改进算法,在窗口内根据两条记录的长度比例首先将不可能构成相似重复记录的数据排除在外,减少了记录比较的次数,提高了检测效率;进一步通过设置属性有效性因子和权重比例计算有效权值,利用有效权值进行检测,提高了查全率和查准率。实验证明改进算法在各种性能上均优于SNM算法。
关键词: 相似重复记录, 数据清洗, 有效权值, SNM算法
GUO Wenlong. Improved SNM algorithm based on length filtering and effective weights[J]. Computer Engineering and Applications, 2014, 50(19): 123-127.
郭文龙. 基于长度过滤和有效权值的SNM改进算法[J]. 计算机工程与应用, 2014, 50(19): 123-127.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/
http://cea.ceaj.org/EN/Y2014/V50/I19/123