计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (35): 126-128.DOI: 10.3778/j.issn.1002-8331.2010.35.036

• 数据库、信号与信息处理 • 上一篇    下一篇

大数据集挖掘的层次二分抽样算法

王玉荣,钱雪忠   

  1. 江南大学 信息工程学院,江苏 无锡 214122
  • 收稿日期:2010-05-24 修回日期:2010-08-09 出版日期:2010-12-11 发布日期:2010-12-11
  • 通讯作者: 王玉荣

Level two sub sampling algorithm of mining large data sets

WANG Yu-rong,QIAN Xue-zhong   

  1. School of Information,Jiangnan University,Wuxi,Jiangsu 214122,China
  • Received:2010-05-24 Revised:2010-08-09 Online:2010-12-11 Published:2010-12-11
  • Contact: WANG Yu-rong

摘要: 针对目前关联规则挖掘的数据集不断增大,而很多抽样算法精度不高还要解决一系列NP难问题等情况。在分析利用频繁1项集进行抽样处理的基础上,提出了高精度的基于频繁n项集平均划分的关联规则挖掘算法——EHAC算法。理论和实验都表明,EHAC能够提高数据挖掘精度,在数据平均划分的同时,尽量保证频繁n项集能够平均划分,减少了数据库扫描次数,一定程度上缩减了数据库规模。

关键词: 大数据集, 关联规则挖掘, 抽样算法, EHAC算法, 准则系数

Abstract: For the data sets of the current association rule mining increasing,many sampling algorithm accuracy is not high and have to solve a series of NP hard problems.On the basis of using one frequent item to sample process,the association rules mining algorithm which based on the average classification of the n frequent itemsets——EHAC algorithm is presented.Theory and experiment show that EHAC can improve the accuracy of data mining,ensure the frequent itemsets can be divided average with the data be divided average,reduce the number of database scans,reduce the size of the database to a certain extent.

Key words: large data sets, association rules mining, sampling algorithm, EHAC algorithm, guide coefficient

中图分类号: