计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (29): 171-173.DOI: 10.3778/j.issn.1002-8331.2008.29.048

• 数据库、信号与信息处理 • 上一篇    下一篇

一种在高维空间中聚类检测重复记录的新方法

曹渠江,董 明   

  1. 上海理工大学 计算机与电气工程学院,上海 200093
  • 收稿日期:2007-12-03 修回日期:2008-03-03 出版日期:2008-10-11 发布日期:2008-10-11
  • 通讯作者: 曹渠江

New approach for clustering similar duplicate records based on high dimensions

CAO Qu-jiang,DONG Ming   

  1. Department of Computer and Electrical Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China
  • Received:2007-12-03 Revised:2008-03-03 Online:2008-10-11 Published:2008-10-11
  • Contact: CAO Qu-jiang

摘要: 数据清理是构建数据仓库中的一个重要研究领域。检测相似重复记录是数据清洗中一项非常重要的任务。提出了一种聚类检测相似重复记录的新方法,该方法是基于N-gram将关系表中的记录映射到高维空间中,并且通过可调密度的改进型DBSCAN算法IDS来聚类检测相似重复记录。并用实验证明了这种方法的有效性。

关键词: 相似重复记录, N-gram, 入侵检测系统

Abstract: Data cleaning is an important area of data warehouse.Detecting duplicate records is a critical task in data cleaning.A new duplicate detection methods is proposed in this paper.The approach based on N-gram mappings all records in a relation to a high dimensions and clusters duplicate records through an improved DBSCAN algorithms which named IDS.IDS can cluster approximately duplicate records by using adjustable density.At last the experimental results prove the approach’s effectiveness.

Key words: approximately duplicate database, N-gram, Intrusion Detection System(IDS)