计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (2): 81-85.

• 大数据与云计算 • 上一篇    下一篇

基于编辑图的XML文档相似性研究

徐沛娟,齐福慧,李  卓,王利民   

  1. 吉林大学 计算机科学与技术学院,长春 130012
  • 出版日期:2016-01-15 发布日期:2016-01-28

Research of XML document similarity based on edit graph

XU Peijuan, QI Fuhui, LI Zhuo, WANG Limin   

  1. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Online:2016-01-15 Published:2016-01-28

摘要: 目前关于XML文档相似性算法有很多种,其中基于编辑距离的方法是很重要的一类。目前已发表的基于编辑距离的算法中,编辑图算法由于其计算高效率的特点成为研究的出发点。首先介绍了编辑图算法的思想,由于它在计算过程中对同层兄弟节点的顺序有很强的依赖性,因此不能准确有效地比较数据无序的数据中心的XML文档相似性。针对该问题,在编辑图算法思想的基础上,结合路径算法的思想提出拆分编辑图算法。实验结果表明,拆分编辑图算法降低了编辑图算法中对兄弟节点次序的依赖性,更适合于数据中心的XML文档相似性比较,而且所得结果更加准确有效。

关键词: 可扩展标记语言(XML), 可扩展标记语言(XML)相似性, 编辑图, 编辑脚本, 拆分, 子路径集

Abstract: There are many algorithms for comparing XML similarity so far, and ED-based method is one of the most important classes. Because of the high efficiency feature, the edit graph algorithm becomes the basis of many ED algorithms. Firstly, the article introduces the idea of edit graph, because it has a strong dependence on the order of sibling nodes which is in the same layer in the sorting process, so the edit graph algorithm is not accurate and effective to compare the data-center XML document similarity. To resolve the problem, splitting edit graph algorithm based on edit graph and path algorithm is presented. Experimental results show that the algorithm reduces the dependence on the sibling order of the same layer of the edit graph algorithm, and it is more suitable for the data-center XML document similarity comparision, and the result of split edit graph algorithm is more accurate and effective.

Key words: eXtensible Markup Language(XML), eXtensible Markup Language(XML) similarity, edit graph, edit script, split, sub-path sets