Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql

doi:10.3778/j.issn.1002-8331.1912-0476

Abstract

Abstract:

Aiming at the problems of the association rule algorithm based on sub-spark, such as high I/O overhead, single data structure and mining frequent sets, and low efficiency of computing support, this paper proposes an algorithm based on SparkSql for distributed programming. The data set is loaded into the DataFrame, and the improved bloon filter is used to efficiently store the item sets generated during the mining process to solve the problem of RDD memory resources and computation speed limitation. Transactions, items and item sets are simplified based on the prior theorem, and the support degree of item sets is calculated by intersection of items in item sets corresponding to transaction sets, so as to improve the efficiency of calculation support degree. Two iterative algorithms and selection conditions are proposed to enhance the generalization of the proposed algorithm to various data sets. Several experiments are carried out to prove that the algorithm in this paper is always adaptive to the characteristics of the data in this iteration and chooses the optimal iteration method. At the same time, the algorithm has high parallel algorithm performance and can expand larger and larger clusters and data. Compared with the association rules algorithm YAFIM and R-Apriori based on Spark framework, the algorithm has better performance in each iteration and overall running calculation efficiency.

Key words: frequent episodes, big data, candidate set, adaptive data, bloom fileter, SparkSql

摘要：

针对基于Spark框架的关联规则算法存在I/O开销大、数据结构和挖掘频繁集方式单一、计算支持度的方式效率低等问题，提出基于SparkSql进行分布式编程的算法。将数据集加载到DataFrame，利用改进后的布隆过滤器高效存储频繁集挖掘过程中产生的项集，解决RDD内存资源和计算速度受限问题。基于先验定理对事务、项目和项集进行精简，同时提出用Sql语句对项集中项目对应事务集合求交集的方式计算项集支持度，提高计算支持度的效率。提出了两种迭代算法和自适应数据的选择条件，增强该算法对各种数据集的泛化性。进行多组实验，证明提出的算法总是自适应本次迭代数据的特点选择最优的迭代方法，同时具有较高并行算法性能，可以扩展到更大规模集群和数据；同基于Spark框架的关联规则算法YAFIM和R-Apriori进行对比，在每次迭代和总体运行计算效率上有更好的表现。

关键词: 频繁集, 大数据, 候选集, 自适应数据, 布隆过滤器, SparkSql

WANG Yonggui, GUO Xintong. Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql[J]. Computer Engineering and Applications, 2020, 56(21): 72-78.

王永贵，郭昕彤. SparkSql上自适应数据集的高效频繁集挖掘算法[J]. 计算机工程与应用, 2020, 56(21): 72-78.

[1]	WU Hao, XU Xingjian, MENG Fanjun. Knowledge Graph-Assisted Multi-task Feature-Based Course Recommendation Algorithm [J]. Computer Engineering and Applications, 2021, 57(21): 132-139.
[2]	WU Dongyang, DOU Jianping, LI Jun. Design of Digital Twin System for Quadrotor [J]. Computer Engineering and Applications, 2021, 57(16): 237-244.
[3]	LI Ling, GU Xiaomei, LIU Zihao. Application Research of Multi-subdomain Random Forest in Context-Aware Recommendation [J]. Computer Engineering and Applications, 2020, 56(22): 132-141.
[4]	ZHANG Meng, SUN Bingzhen, CHU Xiaoli. Gout Diagnosis Model Based on Neighborhood Cost Sensitive Three-Way Decision [J]. Computer Engineering and Applications, 2020, 56(16): 218-225.
[5]	WU Yangyang, TANG Jianguo. Research Progress of Attribute Reduction Based on Rough Set in Context of Big Data [J]. Computer Engineering and Applications, 2019, 55(6): 31-38.
[6]	WANG Jingyu, LUAN Junqing, TAN Yuesheng. Research on Big Data Access Control Model Based on Data Sensitivity [J]. Computer Engineering and Applications, 2019, 55(23): 70-77.
[7]	HOU Yu1，2, QIN Xiaolin2, PENG Haoyue1，2, ZHANG Lige1，2. Feature Selection Based on Global Pitch Adjusting Harmony Search Algorithm [J]. Computer Engineering and Applications, 2019, 55(2): 21-27.
[8]	WANG Dexian, HE Xianbo, HE Chunlin, ZHOU Kun, CHEN Minzhi. Latent Factor Prediction Model Combining L1 and L2 Regularization Constraints [J]. Computer Engineering and Applications, 2019, 55(19): 121-127.
[9]	WANG Yuan, PENG Chenhui, WANG Zhiqiang, FAN Qiang, YAO Yiyang, HUA Zhaoyun. Application of Knowledge Graph in Full-Service Unified Data Center of National Grid [J]. Computer Engineering and Applications, 2019, 55(15): 104-109.
[10]	LI Yufan1, ZHANG Huifu2, LIU Shangli2, TANG Bing1. Research Progress on Educational Data Mining [J]. Computer Engineering and Applications, 2019, 55(14): 15-23.
[11]	CAO Jingjing1, REN Xinxin2, XU Xianhao2. Research on Logistics Path Frequent Patterns Based on Parallel Apriori [J]. Computer Engineering and Applications, 2019, 55(11): 257-264.
[12]	KANG Jiaxing, NIU Baoning, HAO Jinyao. College of Information and Computer, Taiyuan University of Technology, Taiyuan 030000, China [J]. Computer Engineering and Applications, 2019, 55(10): 233-239.
[13]	ZHAO Jiantang. Partitioned Extreme Learning Machine for Big Data [J]. Computer Engineering and Applications, 2019, 55(10): 73-76.
[14]	LU Kai, XU Hua. Efficient ML-kNN Algorithm on Large Data Set [J]. Computer Engineering and Applications, 2019, 55(1): 84-88.
[15]	LIU Haiyan, ZHANG Yu, BI Jianquan, XING Meng. Review of technology based on distributed and collaborative network intrusion detection [J]. Computer Engineering and Applications, 2018, 54(8): 1-6.

Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql

SparkSql上自适应数据集的高效频繁集挖掘算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics