Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (29): 171-173.DOI: 10.3778/j.issn.1002-8331.2008.29.048

• 数据库、信号与信息处理 • Previous Articles     Next Articles

New approach for clustering similar duplicate records based on high dimensions

CAO Qu-jiang,DONG Ming   

  1. Department of Computer and Electrical Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2007-12-03 Revised:2008-03-03 Online:2008-10-11 Published:2008-10-11
  • Contact: CAO Qu-jiang

一种在高维空间中聚类检测重复记录的新方法

曹渠江,董 明   

  1. 上海理工大学 计算机与电气工程学院,上海 200093
  • 通讯作者: 曹渠江

Abstract: Data cleaning is an important area of data warehouse.Detecting duplicate records is a critical task in data cleaning.A new duplicate detection methods is proposed in this paper.The approach based on N-gram mappings all records in a relation to a high dimensions and clusters duplicate records through an improved DBSCAN algorithms which named IDS.IDS can cluster approximately duplicate records by using adjustable density.At last the experimental results prove the approach’s effectiveness.

Key words: approximately duplicate database, N-gram, Intrusion Detection System(IDS)

摘要: 数据清理是构建数据仓库中的一个重要研究领域。检测相似重复记录是数据清洗中一项非常重要的任务。提出了一种聚类检测相似重复记录的新方法,该方法是基于N-gram将关系表中的记录映射到高维空间中,并且通过可调密度的改进型DBSCAN算法IDS来聚类检测相似重复记录。并用实验证明了这种方法的有效性。

关键词: 相似重复记录, N-gram, 入侵检测系统