计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (35): 5-8.DOI: 10.3778/j.issn.1002-8331.2008.35.002

• 博士论坛 • 上一篇    下一篇

基于生成子的频繁项集聚类算法

李晋宏1,2,杨炳儒1,宋 威2,侯 伟1   

  1. 1.北京科技大学 信息工程学院,北京 100083
    2.北方工业大学 信息工程学院,北京 100144
  • 收稿日期:2008-09-12 修回日期:2008-10-06 出版日期:2008-12-11 发布日期:2008-12-11
  • 通讯作者: 李晋宏

Algorithm for clustering frequent itemsets based on generators

LI Jin-hong1,2,YANG Bing-ru1,SONG Wei2,HOU Wei1   

  1. 1.School of Information Engineering,University of Science and Technology Beijing,Beijing 100083,China
    2.College of Information Engineering,North China University of Technology,Beijing 100144,China
  • Received:2008-09-12 Revised:2008-10-06 Online:2008-12-11 Published:2008-12-11
  • Contact: LI Jin-hong

摘要: 如何有效地约简频繁项集的数量是目前数据挖掘研究的热点。对频繁项集进行聚类是该问题的解决方法之一。由于生成子是全体频繁项集的无损精简表示,故对生成子进行聚类与对全体频繁项集进行聚类具有相同的效果。提出了一种基于生成子的频繁项集聚类算法。首先,利用最小描述长度原理,讨论了选择生成子进行聚类的合理性;其次,给出了生成子的剪枝策略及挖掘算法;最后,在一种新的项集相似性的度量标准的基础上,给生成子的聚类算法。实验结果表明,该方法可有效地减少项集的数量,并具有较高的挖掘效率。

关键词: 数据挖掘, 生成子, 聚类

Abstract: How to reduce the number of frequent itemsets effectively is a hot topic in data mining research.Clustering frequent itemsets is one solution to the problem.Since generators are lossless concise representations of all frequent itemsets,clustering generators is equivalent to clustering all frequent itemsets.A new algorithm for clustering frequent itemsets based on generators is proposed.Firstly,based on minimum description length principle,the rationality of clustering generators is discussed.Secondly,the pruning strategies and mining algorithm for generators are proposed.Finally,based on a new similarity criterion of frequent itemsets,the clustering algorithm is presented.Experimental results show that the proposed method can not only reduce the number of discovered itemsets,but also is efficient.

Key words: data mining, generator, clustering