Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (21): 72-78.DOI: 10.3778/j.issn.1002-8331.1912-0476

Previous Articles     Next Articles

Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql

WANG Yonggui, GUO Xintong   

  1. School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2020-11-01 Published:2020-11-03

SparkSql上自适应数据集的高效频繁集挖掘算法

王永贵,郭昕彤   

  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105

Abstract:

Aiming at the problems of the association rule algorithm based on sub-spark, such as high I/O overhead, single data structure and mining frequent sets, and low efficiency of computing support, this paper proposes an algorithm based on SparkSql for distributed programming. The data set is loaded into the DataFrame, and the improved bloon filter is used to efficiently store the item sets generated during the mining process to solve the problem of RDD memory resources and computation speed limitation. Transactions, items and item sets are simplified based on the prior theorem, and the support degree of item sets is calculated by intersection of items in item sets corresponding to transaction sets, so as to improve the efficiency of calculation support degree. Two iterative algorithms and selection conditions are proposed to enhance the generalization of the proposed algorithm to various data sets. Several experiments are carried out to prove that the algorithm in this paper is always adaptive to the characteristics of the data in this iteration and chooses the optimal iteration method. At the same time, the algorithm has high parallel algorithm performance and can expand larger and larger clusters and data. Compared with the association rules algorithm YAFIM and R-Apriori based on Spark framework, the algorithm has better performance in each iteration and overall running calculation efficiency.

Key words: frequent episodes, big data, candidate set, adaptive data, bloom fileter, SparkSql

摘要:

针对基于Spark框架的关联规则算法存在I/O开销大、数据结构和挖掘频繁集方式单一、计算支持度的方式效率低等问题,提出基于SparkSql进行分布式编程的算法。将数据集加载到DataFrame,利用改进后的布隆过滤器高效存储频繁集挖掘过程中产生的项集,解决RDD内存资源和计算速度受限问题。基于先验定理对事务、项目和项集进行精简,同时提出用Sql语句对项集中项目对应事务集合求交集的方式计算项集支持度,提高计算支持度的效率。提出了两种迭代算法和自适应数据的选择条件,增强该算法对各种数据集的泛化性。进行多组实验,证明提出的算法总是自适应本次迭代数据的特点选择最优的迭代方法,同时具有较高并行算法性能,可以扩展到更大规模集群和数据;同基于Spark框架的关联规则算法YAFIM和R-Apriori进行对比,在每次迭代和总体运行计算效率上有更好的表现。

关键词: 频繁集, 大数据, 候选集, 自适应数据, 布隆过滤器, SparkSql