Level two sub sampling algorithm of mining large data sets

doi:10.3778/j.issn.1002-8331.2010.35.036

Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (35): 126-128.DOI: 10.3778/j.issn.1002-8331.2010.35.036

• 数据库、信号与信息处理 • Previous Articles Next Articles

Level two sub sampling algorithm of mining large data sets

WANG Yu-rong，QIAN Xue-zhong

School of Information，Jiangnan University，Wuxi，Jiangsu 214122，China

Received:2010-05-24 Revised:2010-08-09 Online:2010-12-11 Published:2010-12-11
Contact: WANG Yu-rong

大数据集挖掘的层次二分抽样算法

王玉荣，钱雪忠

江南大学信息工程学院，江苏无锡 214122

通讯作者: 王玉荣

Abstract

Abstract: For the data sets of the current association rule mining increasing，many sampling algorithm accuracy is not high and have to solve a series of NP hard problems.On the basis of using one frequent item to sample process，the association rules mining algorithm which based on the average classification of the n frequent itemsets——EHAC algorithm is presented.Theory and experiment show that EHAC can improve the accuracy of data mining，ensure the frequent itemsets can be divided average with the data be divided average，reduce the number of database scans，reduce the size of the database to a certain extent.

Key words: large data sets, association rules mining, sampling algorithm, EHAC algorithm, guide coefficient

摘要： 针对目前关联规则挖掘的数据集不断增大，而很多抽样算法精度不高还要解决一系列NP难问题等情况。在分析利用频繁1项集进行抽样处理的基础上，提出了高精度的基于频繁n项集平均划分的关联规则挖掘算法——EHAC算法。理论和实验都表明，EHAC能够提高数据挖掘精度，在数据平均划分的同时，尽量保证频繁n项集能够平均划分，减少了数据库扫描次数，一定程度上缩减了数据库规模。

关键词: 大数据集, 关联规则挖掘, 抽样算法, EHAC算法, 准则系数

CLC Number:

TP311

WANG Yu-rong，QIAN Xue-zhong . Level two sub sampling algorithm of mining large data sets[J]. Computer Engineering and Applications, 2010, 46(35): 126-128.

王玉荣，钱雪忠. 大数据集挖掘的层次二分抽样算法[J]. 计算机工程与应用, 2010, 46(35): 126-128.

[1]	QIU Ningjia, SHEN Zhuorui, WANG Hui, WANG Peng. Semi-supervised Learning Optimization Algorithm for Communication Spam Text Recognition [J]. Computer Engineering and Applications, 2020, 56(17): 121-128.
[2]	CHEN Qiulian, JIANG Huanyu, ZHENG Yijun. Summary of Rapidly-Exploring Random Tree Algorithm in Robot Path Planning [J]. Computer Engineering and Applications, 2019, 55(16): 10-17.
[3]	LIU Chenguang1, LIU Weihui2, YAN Liyan1. Feature perception adaptive flow sampling method based on NetFlow [J]. Computer Engineering and Applications, 2014, 50(24): 104-108.
[4]	ZHANG Zhenzhen，DONG Cailin，CHEN Zengzhao，HE Xiuling. Improved fast classifier based on SVM and density clustering [J]. Computer Engineering and Applications, 2011, 47(2): 136-138.
[5]	LIU Hui-ting¹，NI Zhi-wei². Applications of EMD in generating synopses of data stream. [J]. Computer Engineering and Applications, 2010, 46(22): 6-8.
[6]	LV Jun-jie¹，LIU Li². New fragment marking algorithm for IP traceback [J]. Computer Engineering and Applications, 2010, 46(13): 4-7.
[7]	JIA Jun-fang,ZHANG Ri-quan. Large data sets clustering analysis based on distribution [J]. Computer Engineering and Applications, 2008, 44(28): 133-135.
[8]	LI Guang-yuan,LEI Hong,LONG Long. New method for dynamic itemset mining [J]. Computer Engineering and Applications, 2008, 44(21): 209-211.
[9]	ZHANG Long-bo^1,2，LI Zhan-huai²，YU Min²，JIANG Yun². Improved random sampling algorithms for sliding windows over weighted streaming data [J]. Computer Engineering and Applications, 2007, 43(25): 18-20.

Level two sub sampling algorithm of mining large data sets

大数据集挖掘的层次二分抽样算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 9

Recommended Articles

Metrics