计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (19): 123-127.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于长度过滤和有效权值的SNM改进算法

郭文龙   

  1. 福建江夏学院 电子信息科学学院,福州 350108
  • 出版日期:2014-10-01 发布日期:2014-09-29

Improved SNM algorithm based on length filtering and effective weights

GUO Wenlong   

  1. College of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, China
  • Online:2014-10-01 Published:2014-09-29

摘要: 异构数据库集成中产生了相似重复记录,但数量是有限的,采用传统的SNM算法进行检测,需要在窗口内对所有记录进行比对,效率不高。针对这一缺陷,提出一种基于长度过滤和有效权值的SNM改进算法,在窗口内根据两条记录的长度比例首先将不可能构成相似重复记录的数据排除在外,减少了记录比较的次数,提高了检测效率;进一步通过设置属性有效性因子和权重比例计算有效权值,利用有效权值进行检测,提高了查全率和查准率。实验证明改进算法在各种性能上均优于SNM算法。

关键词: 相似重复记录, 数据清洗, 有效权值, SNM算法

Abstract: Approximately duplicate records are produced in heterogeneous database integration, but the numbers of which are limited. Using the traditional SNM algorithm to detect approximately duplicate records, needs to compare all records in the window, and the efficiency is not high. For the defects, an improved SNM algorithm based on the length filtering and effective weights is proposed. According to the length proportion of two records in the window, the records which are impossible to be approximately duplicate are excluded firstly, so it can reduce the number of records comparison, and improve the detection efficiency. By setting the validity factor and weight proportion of the records attribute furtherly, it calculates the effective weights, then according to the weights, detects the records. The recall ratio and the precision ratio are improved. The results of experiments show that the improved algorithm is better than SNM algorithm in various performance.

Key words: approximately duplicate records, data cleaning, effective weights, Sorted-Neighborhood Method(SNM)