基于频繁结构的XML文档聚类

计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (9): 135-138.

• 数据库、信号与信息处理 • 上一篇下一篇

基于频繁结构的XML文档聚类

傅珊珊,吴扬扬

华侨大学计算机科学系，福建泉州 362021

收稿日期:2007-07-12 修回日期:2007-10-16 出版日期:2008-03-21 发布日期:2008-03-21
通讯作者: 傅珊珊

XML document clustering using frequent structure

FU Shan-shan,WU Yang-yang

Department of Computer Science，Huaqiao University，Quanzhou，Fujian 362021，China

Received:2007-07-12 Revised:2007-10-16 Online:2008-03-21 Published:2008-03-21
Contact: FU Shan-shan

摘要/Abstract

摘要： 研究基于频繁结构的XML文档聚类方法，其频繁结构包括频繁路径和频繁子树。首先介绍一种挖掘XML文档中所有嵌入频繁子树的算法SSTMiner，对SSTMiner算法进行修改，得到FrePathMiner算法和FreTreeMiner算法，分别用于挖掘XML文档中最大频繁路径和最大频繁子树，在此基础上，提出一种凝聚的层次聚类算法XMLCluster，分别以最大频繁路径和最大频繁子树作为XML文档的特征，对文档进行聚类。实验结果表明FrePathMiner算法和FreTreeMiner算法找到频繁结构的数量都比传统的ASPMiner算法多，这就可以为文档聚类提供更多的结构特征，从而获得更高的聚类精度。

关键词: XML文档聚类, 最大频繁路径, 最大频繁子树, 层次聚类

Abstract: This paper researches XML document clustering using frequent structure，which includes frequent path and frequent tree.The paper firstly presents an efficient algorithm called SSTMiner，which is used to mine all embedded frequent trees in XML documents.The algorithm can be modified a little to generate FrePathMiner algorithm and FreTreeMiner algorithm，which can be respectively used to mine common frequent path and common frequent tree.Then by using common frequent path and common frequent tree to characterize the XML documents，an agglomerative hierarchical clustering algorithm called XMLCluster is propesed to cluster XML documents.The experiment results show that both FrePathMiner and FreTreeMiner can find more frequent structures than ASPMiner，so they can provide more characters for clustering and can get higher clustering precision.

Key words: XML document clustering, common frequent path, common frequent trees, hierarchical clustering

傅珊珊,吴扬扬. 基于频繁结构的XML文档聚类[J]. 计算机工程与应用, 2008, 44(9): 135-138.

FU Shan-shan,WU Yang-yang. XML document clustering using frequent structure[J]. Computer Engineering and Applications, 2008, 44(9): 135-138.

[1]	王俊玲，卢新明. 基于语义相关的视频关键帧提取算法[J]. 计算机工程与应用, 2021, 57(4): 192-198.
[2]	洪征，龚启缘，冯文博，李毅豪. 自适应聚类的未知应用层协议识别方法[J]. 计算机工程与应用, 2020, 56(5): 109-117.
[3]	王熙月1，黄毅鹏1，钱佳慧1，何凌1，黄华1，尹恒2. 基于声学特征的腭裂语音声韵母切分[J]. 计算机工程与应用, 2018, 54(8): 123-130.
[4]	宋冬云，郑瑾，张祖平. 基于混合策略的中文短文本相似度计算[J]. 计算机工程与应用, 2018, 54(12): 116-120.
[5]	王海涌，冯兆旭，杨海波，张津栋. 基于结构相似网页聚类的正文提取算法研究[J]. 计算机工程与应用, 2018, 54(11): 122-127.
[6]	赖松轩，李艳雄. 说话人聚类的初始类生成方法[J]. 计算机工程与应用, 2017, 53(3): 149-153.
[7]	徐绕山1，2，王爽2，3，孙正兴2. 视觉相似性计算的艺术图像自组织方法[J]. 计算机工程与应用, 2017, 53(18): 163-169.
[8]	王丽科，赵菊敏，李灯熬. 针对超市购物数据的深度分析算法[J]. 计算机工程与应用, 2017, 53(14): 18-23.
[9]	蔡蓉，钱东，王丹丹，朱平. 一种兼具生物和物理特征的E基因签名方法#br# ——以p53家族基因为例[J]. 计算机工程与应用, 2017, 53(13): 155-159.
[10]	康茜1，李德玉1，2，王素格1，2，冀庆斌1. 传播过程中信号缺失的层次聚类社区发现算法[J]. 计算机工程与应用, 2015, 51(9): 201-206.
[11]	孙浩军，闪光辉，高玉龙，袁婷. 一种高维混合属性数据聚类算法[J]. 计算机工程与应用, 2015, 51(8): 128-133.
[12]	仰孝富，齐建东，吉鹏飞，朱文飞. 一种CF树结合KNN图划分的文本聚类算法[J]. 计算机工程与应用, 2015, 51(6): 114-119.
[13]	董丽丽，董玮，张翔. 利用CUDA提高内存数据聚类效能的研究[J]. 计算机工程与应用, 2015, 51(22): 243-251.
[14]	张菲菲1，李宗海2，周晓辉1，李晓戈1,2. 基于层次聚类的跨文本中文人名消歧研究[J]. 计算机工程与应用, 2014, 50(6): 106-111.
[15]	贾阳，王慧琴，胡燕，党勃. 基于改进层次聚类和SVM的图像型火焰识别[J]. 计算机工程与应用, 2014, 50(5): 165-168.

基于频繁结构的XML文档聚类

XML document clustering using frequent structure

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics