计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (28): 146-148.DOI: 10.3778/j.issn.1002-8331.2010.28.041

• 数据库、信号与信息处理 • 上一篇    下一篇

基于扩展树状知识库的海量数据清洗算法

燕彩蓉1,孙圭宁2,高念高2   

  1. 1.东华大学 计算机学院,上海 201620
    2.众恒信息产业有限公司,上海 200040
  • 收稿日期:2009-03-02 修回日期:2009-04-22 出版日期:2010-10-01 发布日期:2010-10-01
  • 通讯作者: 燕彩蓉

Mass data cleaning algorithm based on extended tree-like knowledge base

YAN Cai-rong1,SUN Gui-ning2,GAO Nian-gao2   

  1. 1.School of Computer,Donghua University,Shanghai 201620,China
    2.Triman Information & Technology Ltd.,Shanghai 200040,China
  • Received:2009-03-02 Revised:2009-04-22 Online:2010-10-01 Published:2010-10-01
  • Contact: YAN Cai-rong

摘要: 针对传统知识库表示的局限性,通过分解和重组领域知识,建立扩展树状结构的知识库,其中叶结点对应具体知识实例,称为原子知识,非叶结点只对应知识概念。同时提出相关的数据清洗算法,根据用户的选择,自动提取原子知识进行分析,消除重复,按照处理权重建立原子知识序列,然后逐一对数据进行清洗。实验表明,该算法能有效优化用户的请求,减少对海量数据的遍历次数,海量数据的清洗效率明显提高。

关键词: 领域知识, 知识库, 数据清洗, 海量数据

Abstract: By analyzing the limitation of traditional structures of knowledge base,an extended tree-like knowledge base is built by decomposing and recomposing the domain knowledge.The leaf node of the tree is linked with the knowledge instance called atomic knowledge and the non-leaf node is linked with the concept of knowledge.Based on the knowledge base,a data cleaning algorithm is proposed.It extracts atomic knowledge of the selected nodes firstly,then analyzes their relations,deletes the same objects,builds an atomic knowledge sequence based on weights,lastly cleans data according to the sequence.The experiment shows that the count of scaning mass data can be reduced rapidly by adopting the algorithm to optimize the users’ requests and the data cleaning efficiency can be improved clearly.

Key words: domain knowledge, knowledge base, data cleaning, mass data

中图分类号: