Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (1): 144-146.DOI: 10.3778/j.issn.1002-8331.2011.01.039

• 数据库、信号与信息处理 • Previous Articles     Next Articles

GML document structural clustering algorithm based on frequent subtree patterns

ZHU Yingwen1,JI Genlin2,SUN Qinhong1   

  1. 1.Department of Computer Foundation Teaching,Sanjiang University,Nanjing 210012,China
    2.School of Computer,Nanjing Normal University,Nanjing 210097,China
  • Received:2009-08-24 Revised:2009-10-24 Online:2011-01-01 Published:2011-01-01
  • Contact: ZHU Yingwen

基于频繁子树模式的GML文档结构聚类算法

朱颖雯1,吉根林2,孙勤红1   

  1. 1.三江学院 计算机基础部,南京 210012
    2.南京师范大学 计算机学院,南京 210097

  • 通讯作者: 朱颖雯

Abstract: This paper presents algorithm GCFS for clustering GML document structure based on frequent subtree patterns.It firstly mines all maximal and closed frequent Induced subtrees from GML documents;then chooses some subtree patterns to form the clustering features,weights these features according to the length of subtree pattern,computes the similarity of two GML documents by cosine function,uses K-Means algorithm to cluster documents by clustering features.Experiment results show that GCFS is effective and efficient.Its performance is superior to other GML clustering algorithms.

摘要: 提出了一种基于频繁子树模式的GML文档结构聚类算法GCFS(GML Clustering based on Frequent Subtree patterns),与其他相关算法不同,该算法首先挖掘GML文档集合中的最大与闭合频繁Induced子树,并将其作为聚类特征,根据频繁子树的大小赋予不同的权值,采用余弦函数定义相似度,利用K-Means算法对聚类特征进行聚类。实验结果表明算法GCFS是有效的,具有较高的聚类效率,性能优于其他同类算法。

CLC Number: