计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (1): 144-146.DOI: 10.3778/j.issn.1002-8331.2011.01.039

• 数据库、信号与信息处理 • 上一篇    下一篇

基于频繁子树模式的GML文档结构聚类算法

朱颖雯1,吉根林2,孙勤红1   

  1. 1.三江学院 计算机基础部,南京 210012
    2.南京师范大学 计算机学院,南京 210097

  • 收稿日期:2009-08-24 修回日期:2009-10-24 出版日期:2011-01-01 发布日期:2011-01-01
  • 通讯作者: 朱颖雯

GML document structural clustering algorithm based on frequent subtree patterns

ZHU Yingwen1,JI Genlin2,SUN Qinhong1   

  1. 1.Department of Computer Foundation Teaching,Sanjiang University,Nanjing 210012,China
    2.School of Computer,Nanjing Normal University,Nanjing 210097,China
  • Received:2009-08-24 Revised:2009-10-24 Online:2011-01-01 Published:2011-01-01
  • Contact: ZHU Yingwen

摘要: 提出了一种基于频繁子树模式的GML文档结构聚类算法GCFS(GML Clustering based on Frequent Subtree patterns),与其他相关算法不同,该算法首先挖掘GML文档集合中的最大与闭合频繁Induced子树,并将其作为聚类特征,根据频繁子树的大小赋予不同的权值,采用余弦函数定义相似度,利用K-Means算法对聚类特征进行聚类。实验结果表明算法GCFS是有效的,具有较高的聚类效率,性能优于其他同类算法。

Abstract: This paper presents algorithm GCFS for clustering GML document structure based on frequent subtree patterns.It firstly mines all maximal and closed frequent Induced subtrees from GML documents;then chooses some subtree patterns to form the clustering features,weights these features according to the length of subtree pattern,computes the similarity of two GML documents by cosine function,uses K-Means algorithm to cluster documents by clustering features.Experiment results show that GCFS is effective and efficient.Its performance is superior to other GML clustering algorithms.

中图分类号: