Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (29): 171-173.DOI: 10.3778/j.issn.1002-8331.2008.29.048
• 数据库、信号与信息处理 • Previous Articles Next Articles
CAO Qu-jiang,DONG Ming
Received:
Revised:
Online:
Published:
Contact:
曹渠江,董 明
通讯作者:
Abstract: Data cleaning is an important area of data warehouse.Detecting duplicate records is a critical task in data cleaning.A new duplicate detection methods is proposed in this paper.The approach based on N-gram mappings all records in a relation to a high dimensions and clusters duplicate records through an improved DBSCAN algorithms which named IDS.IDS can cluster approximately duplicate records by using adjustable density.At last the experimental results prove the approach’s effectiveness.
Key words: approximately duplicate database, N-gram, Intrusion Detection System(IDS)
摘要: 数据清理是构建数据仓库中的一个重要研究领域。检测相似重复记录是数据清洗中一项非常重要的任务。提出了一种聚类检测相似重复记录的新方法,该方法是基于N-gram将关系表中的记录映射到高维空间中,并且通过可调密度的改进型DBSCAN算法IDS来聚类检测相似重复记录。并用实验证明了这种方法的有效性。
关键词: 相似重复记录, N-gram, 入侵检测系统
CAO Qu-jiang,DONG Ming. New approach for clustering similar duplicate records based on high dimensions[J]. Computer Engineering and Applications, 2008, 44(29): 171-173.
曹渠江,董 明. 一种在高维空间中聚类检测重复记录的新方法[J]. 计算机工程与应用, 2008, 44(29): 171-173.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2008.29.048
http://cea.ceaj.org/EN/Y2008/V44/I29/171