Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (35): 126-128.DOI: 10.3778/j.issn.1002-8331.2010.35.036

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Level two sub sampling algorithm of mining large data sets

WANG Yu-rong,QIAN Xue-zhong   

  1. School of Information,Jiangnan University,Wuxi,Jiangsu 214122,China
  • Received:2010-05-24 Revised:2010-08-09 Online:2010-12-11 Published:2010-12-11
  • Contact: WANG Yu-rong

大数据集挖掘的层次二分抽样算法

王玉荣,钱雪忠   

  1. 江南大学 信息工程学院,江苏 无锡 214122
  • 通讯作者: 王玉荣

Abstract: For the data sets of the current association rule mining increasing,many sampling algorithm accuracy is not high and have to solve a series of NP hard problems.On the basis of using one frequent item to sample process,the association rules mining algorithm which based on the average classification of the n frequent itemsets——EHAC algorithm is presented.Theory and experiment show that EHAC can improve the accuracy of data mining,ensure the frequent itemsets can be divided average with the data be divided average,reduce the number of database scans,reduce the size of the database to a certain extent.

Key words: large data sets, association rules mining, sampling algorithm, EHAC algorithm, guide coefficient

摘要: 针对目前关联规则挖掘的数据集不断增大,而很多抽样算法精度不高还要解决一系列NP难问题等情况。在分析利用频繁1项集进行抽样处理的基础上,提出了高精度的基于频繁n项集平均划分的关联规则挖掘算法——EHAC算法。理论和实验都表明,EHAC能够提高数据挖掘精度,在数据平均划分的同时,尽量保证频繁n项集能够平均划分,减少了数据库扫描次数,一定程度上缩减了数据库规模。

关键词: 大数据集, 关联规则挖掘, 抽样算法, EHAC算法, 准则系数

CLC Number: