基于扩展树状知识库的海量数据清洗算法

doi:10.3778/j.issn.1002-8331.2010.28.041

计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (28): 146-148.DOI: 10.3778/j.issn.1002-8331.2010.28.041

• 数据库、信号与信息处理 • 上一篇下一篇

基于扩展树状知识库的海量数据清洗算法

燕彩蓉¹，孙圭宁²，高念高²

1.东华大学计算机学院，上海 201620
2.众恒信息产业有限公司，上海 200040

收稿日期:2009-03-02 修回日期:2009-04-22 出版日期:2010-10-01 发布日期:2010-10-01
通讯作者: 燕彩蓉

Mass data cleaning algorithm based on extended tree-like knowledge base

YAN Cai-rong¹，SUN Gui-ning²，GAO Nian-gao²

1.School of Computer，Donghua University，Shanghai 201620，China
2.Triman Information & Technology Ltd.，Shanghai 200040，China

Received:2009-03-02 Revised:2009-04-22 Online:2010-10-01 Published:2010-10-01
Contact: YAN Cai-rong

摘要/Abstract

摘要： 针对传统知识库表示的局限性，通过分解和重组领域知识，建立扩展树状结构的知识库，其中叶结点对应具体知识实例，称为原子知识，非叶结点只对应知识概念。同时提出相关的数据清洗算法，根据用户的选择，自动提取原子知识进行分析，消除重复，按照处理权重建立原子知识序列，然后逐一对数据进行清洗。实验表明，该算法能有效优化用户的请求，减少对海量数据的遍历次数，海量数据的清洗效率明显提高。

关键词: 领域知识, 知识库, 数据清洗, 海量数据

Abstract: By analyzing the limitation of traditional structures of knowledge base，an extended tree-like knowledge base is built by decomposing and recomposing the domain knowledge.The leaf node of the tree is linked with the knowledge instance called atomic knowledge and the non-leaf node is linked with the concept of knowledge.Based on the knowledge base，a data cleaning algorithm is proposed.It extracts atomic knowledge of the selected nodes firstly，then analyzes their relations，deletes the same objects，builds an atomic knowledge sequence based on weights，lastly cleans data according to the sequence.The experiment shows that the count of scaning mass data can be reduced rapidly by adopting the algorithm to optimize the users’ requests and the data cleaning efficiency can be improved clearly.

Key words: domain knowledge, knowledge base, data cleaning, mass data

中图分类号:

TP311

燕彩蓉¹，孙圭宁²，高念高². 基于扩展树状知识库的海量数据清洗算法[J]. 计算机工程与应用, 2010, 46(28): 146-148.

YAN Cai-rong¹，SUN Gui-ning²，GAO Nian-gao². Mass data cleaning algorithm based on extended tree-like knowledge base[J]. Computer Engineering and Applications, 2010, 46(28): 146-148.

[1]	杜雨菲，吴保国，陈栋. 基于产生式规则的乔灌木识别推理算法研究[J]. 计算机工程与应用, 2020, 56(5): 242-250.
[2]	谭刚，陈聿，彭云竹. 融合领域特征知识图谱的电网客服问答系统[J]. 计算机工程与应用, 2020, 56(3): 232-239.
[3]	张芳容，杨青. 知识库问答系统中实体关系抽取方法研究[J]. 计算机工程与应用, 2020, 56(11): 219-224.
[4]	王凌阳，陈钦况，寿黎但，陈珂. 多源异构数据的实体匹配方法研究[J]. 计算机工程与应用, 2019, 55(19): 87-95.
[5]	王乐，黄长强，魏政磊. 基于SSA算法的飞行动作规则自动提取[J]. 计算机工程与应用, 2019, 55(14): 203-208.
[6]	杨海涛1，张传斌2，阮镇江1，徐飞1 . 大规模云同步归集数据系统的异步并行优化[J]. 计算机工程与应用, 2017, 53(2): 88-97.
[7]	王永贵，武超，戴伟. 基于MapReduce的随机抽样K-means算法[J]. 计算机工程与应用, 2016, 52(8): 74-79.
[8]	雷玉霞，陈娟，韩永花，王祥德. Frame知识的不一致性分析与修正[J]. 计算机工程与应用, 2016, 52(22): 155-158.
[9]	马月坤，刘鹏飞. 基于知识库的客户网购意向预测系统[J]. 计算机工程与应用, 2016, 52(13): 101-109.
[10]	刘应波1，3，王锋1，2，3，季凯帆2，邓辉2，戴伟1，2，3，梁波2. 基于压缩-字对齐位图的天文海量数据实时索引[J]. 计算机工程与应用, 2016, 52(1): 37-41.
[11]	刘勇，覃飙，余萝. 海量活动轨迹相似查询[J]. 计算机工程与应用, 2015, 51(14): 99-103.
[12]	杨龙1，2，张公让1，2，王力1，2，魏炎炎1，2. 基于知识库分割的多知识库整合方法[J]. 计算机工程与应用, 2014, 50(7): 129-132.
[13]	赵娇娇，赵书良，郭晓波，刘军丹. 基于自然语言生成的关联规则可视化方法[J]. 计算机工程与应用, 2014, 50(23): 122-126.
[14]	郭文龙. 基于长度过滤和有效权值的SNM改进算法[J]. 计算机工程与应用, 2014, 50(19): 123-127.
[15]	姜麟，米允龙，王添. 大数据下不完备信息系统近似空间的并行算法[J]. 计算机工程与应用, 2014, 50(15): 101-106.

基于扩展树状知识库的海量数据清洗算法

Mass data cleaning algorithm based on extended tree-like knowledge base

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics