计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (30): 127-131.

• 数据库、信号与信息处理 • 上一篇    下一篇

聚类反馈学习的数据清洗研究

石彦华,李蜀瑜   

  1. 陕西师范大学 计算机科学学院,西安 710062
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-10-21 发布日期:2011-10-21

Research of data cleaning based on clustering feedback

SHI Yanhua,LI Shuyu   

  1. School of Computer Science,Shaanxi Normal University,Xi’an 710062,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-10-21 Published:2011-10-21

摘要: 重复记录的清除是数据清洗领域的核心问题,但如何实施有效的清除一直是研究的难点。提出了一种通过建立聚类反馈模式规约来验证重复记录的有效性方法。依据经过聚类后各个类别间的关联性关系分析,首先提出了聚类模式和反馈模式的概念和实现方法;然后给出了数据清洗中聚类反馈模式规约;最后应用项目案例验证了它的有效性。

关键词: 数据清洗, 重复记录, 模式规约, 聚类学习, 反馈学习

Abstract: Cleaning Approximately Duplicate Records(CADR) is a core and important issue in data cleaning domain,but how to implement valid and practical CADR is still a research difficulty.Based on those,this paper proposes a Clustering Feedback Pattern Specification(CFPS) to verify the validity of CADR.The concept of cluster pattern and feedback pattern and its algorithms are given based on the analysis of function-to-function relation of the subclass category clustered.And then CFPS is proposed in data cleaning domain.An example resulted in the process of credit data exchange system is given to test the validity of CFPS by using clustering feedback pattern specification.

Key words: data cleaning, duplicate records, pattern specification, cluster learning, feedback learning