Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (9): 135-138.

• 数据库、信号与信息处理 • Previous Articles     Next Articles

XML document clustering using frequent structure

FU Shan-shan,WU Yang-yang   

  1. Department of Computer Science,Huaqiao University,Quanzhou,Fujian 362021,China
  • Received:2007-07-12 Revised:2007-10-16 Online:2008-03-21 Published:2008-03-21
  • Contact: FU Shan-shan

基于频繁结构的XML文档聚类

傅珊珊,吴扬扬   

  1. 华侨大学 计算机科学系,福建 泉州 362021
  • 通讯作者: 傅珊珊

Abstract: This paper researches XML document clustering using frequent structure,which includes frequent path and frequent tree.The paper firstly presents an efficient algorithm called SSTMiner,which is used to mine all embedded frequent trees in XML documents.The algorithm can be modified a little to generate FrePathMiner algorithm and FreTreeMiner algorithm,which can be respectively used to mine common frequent path and common frequent tree.Then by using common frequent path and common frequent tree to characterize the XML documents,an agglomerative hierarchical clustering algorithm called XMLCluster is propesed to cluster XML documents.The experiment results show that both FrePathMiner and FreTreeMiner can find more frequent structures than ASPMiner,so they can provide more characters for clustering and can get higher clustering precision.

Key words: XML document clustering, common frequent path, common frequent trees, hierarchical clustering

摘要: 研究基于频繁结构的XML文档聚类方法,其频繁结构包括频繁路径和频繁子树。首先介绍一种挖掘XML文档中所有嵌入频繁子树的算法SSTMiner,对SSTMiner算法进行修改,得到FrePathMiner算法和FreTreeMiner算法,分别用于挖掘XML文档中最大频繁路径和最大频繁子树,在此基础上,提出一种凝聚的层次聚类算法XMLCluster,分别以最大频繁路径和最大频繁子树作为XML文档的特征,对文档进行聚类。实验结果表明FrePathMiner算法和FreTreeMiner算法找到频繁结构的数量都比传统的ASPMiner算法多,这就可以为文档聚类提供更多的结构特征,从而获得更高的聚类精度。

关键词: XML文档聚类, 最大频繁路径, 最大频繁子树, 层次聚类