XML document clustering using frequent structure

Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (9): 135-138.

• 数据库、信号与信息处理 • Previous Articles Next Articles

XML document clustering using frequent structure

FU Shan-shan,WU Yang-yang

Department of Computer Science，Huaqiao University，Quanzhou，Fujian 362021，China

Received:2007-07-12 Revised:2007-10-16 Online:2008-03-21 Published:2008-03-21
Contact: FU Shan-shan

基于频繁结构的XML文档聚类

傅珊珊,吴扬扬

华侨大学计算机科学系，福建泉州 362021

通讯作者: 傅珊珊

Abstract

Abstract: This paper researches XML document clustering using frequent structure，which includes frequent path and frequent tree.The paper firstly presents an efficient algorithm called SSTMiner，which is used to mine all embedded frequent trees in XML documents.The algorithm can be modified a little to generate FrePathMiner algorithm and FreTreeMiner algorithm，which can be respectively used to mine common frequent path and common frequent tree.Then by using common frequent path and common frequent tree to characterize the XML documents，an agglomerative hierarchical clustering algorithm called XMLCluster is propesed to cluster XML documents.The experiment results show that both FrePathMiner and FreTreeMiner can find more frequent structures than ASPMiner，so they can provide more characters for clustering and can get higher clustering precision.

Key words: XML document clustering, common frequent path, common frequent trees, hierarchical clustering

摘要： 研究基于频繁结构的XML文档聚类方法，其频繁结构包括频繁路径和频繁子树。首先介绍一种挖掘XML文档中所有嵌入频繁子树的算法SSTMiner，对SSTMiner算法进行修改，得到FrePathMiner算法和FreTreeMiner算法，分别用于挖掘XML文档中最大频繁路径和最大频繁子树，在此基础上，提出一种凝聚的层次聚类算法XMLCluster，分别以最大频繁路径和最大频繁子树作为XML文档的特征，对文档进行聚类。实验结果表明FrePathMiner算法和FreTreeMiner算法找到频繁结构的数量都比传统的ASPMiner算法多，这就可以为文档聚类提供更多的结构特征，从而获得更高的聚类精度。

关键词: XML文档聚类, 最大频繁路径, 最大频繁子树, 层次聚类

FU Shan-shan,WU Yang-yang. XML document clustering using frequent structure[J]. Computer Engineering and Applications, 2008, 44(9): 135-138.

傅珊珊,吴扬扬. 基于频繁结构的XML文档聚类[J]. 计算机工程与应用, 2008, 44(9): 135-138.

[1]	WANG Junling, LU Xinming. Video Key Frame Extraction Algorithm Based on Semantic Correlation [J]. Computer Engineering and Applications, 2021, 57(4): 192-198.
[2]	HONG Zheng, GONG Qiyuan, FENG Wenbo, LI Yihao. Unknown Application Layer Protocol Recognition Based on Adaptive Clustering [J]. Computer Engineering and Applications, 2020, 56(5): 109-117.
[3]	WANG Xiyue1, HUANG Yipeng1, QIAN Jiahui1, HE Ling1, HUANG Hua1, YIN Heng2. Initial and final segmentation in cleft palate speech based on acoustic characteristics [J]. Computer Engineering and Applications, 2018, 54(8): 123-130.
[4]	SONG Dongyun, ZHENG Jin, ZHANG Zuping. Chinese short text similarity computation based on hybrid strategy [J]. Computer Engineering and Applications, 2018, 54(12): 116-120.
[5]	WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering [J]. Computer Engineering and Applications, 2018, 54(11): 122-127.
[6]	LAI Songxuan, LI Yanxiong. Generating initial clusters for speaker clustering [J]. Computer Engineering and Applications, 2017, 53(3): 149-153.
[7]	XU Raoshan1，2, WANG Shuang2，3, SUN Zhengxing2. Self-organization method for artistic images based on visual similarity computation [J]. Computer Engineering and Applications, 2017, 53(18): 163-169.
[8]	CAI Rong, QIAN Dong, WANG Dandan, ZHU Ping. E-gene signature method with biological and physical characteristics—case in p53 gene family [J]. Computer Engineering and Applications, 2017, 53(13): 155-159.
[9]	KANG Qian1, LI Deyu1，2, WANG Suge1，2, JI Qingbin1. Community detection algorithm based on hierarchical clustering under signal missing in propagating process [J]. Computer Engineering and Applications, 2015, 51(9): 201-206.
[10]	SUN Haojun, SHAN Guanghui, GAO Yulong, YUAN Ting. Algorithm for clustering of high-dimensional data mixed with numeric and categorical attributes [J]. Computer Engineering and Applications, 2015, 51(8): 128-133.
[11]	ZHANG Feifei1, LI Zonghai2, ZHOU Xiaohui1, LI Xiaoge1,2. Cross-document Chinese personal name entity disambiguation based on hierarchical clustering [J]. Computer Engineering and Applications, 2014, 50(6): 106-111.
[12]	SUN Haojun, SHAN Guanghui, GAO Yulong, YUAN Ting, WU Yunxia. Algorithm for high-dimensional categorical data weighted subspace clustering [J]. Computer Engineering and Applications, 2014, 50(23): 131-135.
[13]	JIANG Jianhong1, LUO Mei2. Research of improved seller’s data clustering method in e-commerce [J]. Computer Engineering and Applications, 2013, 49(8): 27-31.
[14]	WEI Guoqiang, YU Chao. Battlefield dispatching model of scare military supplies with fuzzy parameters [J]. Computer Engineering and Applications, 2013, 49(12): 246-249.
[15]	LIANG Binmei. Detection of top-n global outliers in datasets based on hierarchical clustering [J]. Computer Engineering and Applications, 2012, 48(9): 101-103.

XML document clustering using frequent structure

基于频繁结构的XML文档聚类

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics