Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (36): 1-6.DOI: 10.3778/j.issn.1002-8331.2008.36.001

• 博士论坛 • Previous Articles     Next Articles

Approach of quantifying data quality dimensions

HAN Jing-yu1,2,SONG Ai-bo2,DONG Yi-sheng2   

  1. 1.Institute of Computing,Nanjing University of Posts and Telecommunications,Nanjing 210003,China
    2.Department of Computer Science and Engineering,Southeast University,Nanjing 210096,China
  • Received:2008-09-28 Revised:2008-11-18 Online:2008-12-21 Published:2008-12-21
  • Contact: HAN Jing-yu

数据质量维度量化方法

韩京宇1,2,宋爱波2,董逸生2   

  1. 1.南京邮电大学 计算机技术研究所,南京 210003
    2.东南大学 计算机科学与工程学院,南京 210096
  • 通讯作者: 韩京宇

Abstract: To automatically quantify data quality dimensions in multiple-source environment,it proposes a novel approach to automatically Quantify Dimensions within Context(QDC).Data quality can be gauged by discrepancy between data view and its entity’s perfect representation.Since it is difficult to obtain the perfect representation of entity,it proposes to approximate the perfect representation within its available context and quality dimensions can be quantified in this context scope.By naturally borrowing entropy concepts from information theory,the measurement is easily given for different types of data.In this way,the two most important quality dimensions,that are accuracy and completeness,are properly quantified.This QDC approach can not only give an objective score and ranking in a cooperative multi-source environment but also avoid human’s laborious interaction.As an automatic quality rating solution this approach is distinguished,especially for large scale datasets.Theory and experiment shows the approach performs well for quality rating.

Key words: data quality, information theory, entropy

摘要: 为了实现自动化的数据质量评估,提出了一种在背景范围内的数据质量量化方法QDC(Quantify Dimensions within Context)。数据质量可以用数据和其对应实体的“完美表达”间的差距来衡量。由于“完美表达”很难获得或代价很高,因此提出在多数据源条件下,数据的“完美表达”可以在其背景范围内用投票获得的“最近似”来替代,从而确定了数据质量评估参照的标准。同时提出利用信息论中信息熵指标,将不同类型数据的质量维度统一为通用的度量。作为一种自动化的数据质量评估方法,QDC方法不仅能够对数据的准确性和完整性维度给出准确的评估值,并且具有很高的计算效率。

关键词: 数据质量, 信息论, 信息熵