Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (21): 72-78.DOI: 10.3778/j.issn.1002-8331.1912-0476

Previous Articles     Next Articles

Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql

WANG Yonggui, GUO Xintong   

  1. School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2020-11-01 Published:2020-11-03



  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105


Aiming at the problems of the association rule algorithm based on sub-spark, such as high I/O overhead, single data structure and mining frequent sets, and low efficiency of computing support, this paper proposes an algorithm based on SparkSql for distributed programming. The data set is loaded into the DataFrame, and the improved bloon filter is used to efficiently store the item sets generated during the mining process to solve the problem of RDD memory resources and computation speed limitation. Transactions, items and item sets are simplified based on the prior theorem, and the support degree of item sets is calculated by intersection of items in item sets corresponding to transaction sets, so as to improve the efficiency of calculation support degree. Two iterative algorithms and selection conditions are proposed to enhance the generalization of the proposed algorithm to various data sets. Several experiments are carried out to prove that the algorithm in this paper is always adaptive to the characteristics of the data in this iteration and chooses the optimal iteration method. At the same time, the algorithm has high parallel algorithm performance and can expand larger and larger clusters and data. Compared with the association rules algorithm YAFIM and R-Apriori based on Spark framework, the algorithm has better performance in each iteration and overall running calculation efficiency.

Key words: frequent episodes, big data, candidate set, adaptive data, bloom fileter, SparkSql



关键词: 频繁集, 大数据, 候选集, 自适应数据, 布隆过滤器, SparkSql