Parallelization and optimization of FP_Growth algorithm based on Spark

doi:10.3778/j.issn.1002-8331.1705-0114

Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (13): 52-58.DOI: 10.3778/j.issn.1002-8331.1705-0114

Previous Articles Next Articles

Parallelization and optimization of FP_Growth algorithm based on Spark

SHI Lukui1, ZHANG Xin1, SHI Shengli2

1. School of Computer Science and Software, Hebei University of Technology, Tianjin 300401, China
2. School of Information Technology, Hebei Normal University, Shijiazhuang 050024, China

Online:2018-07-01 Published:2018-07-17

基于Spark的FP_Growth算法的并行与优化

石陆魁1，张欣1，师胜利2

1.河北工业大学计算机科学与软件学院，天津 300401
2.河北师范大学信息技术学院，石家庄 050024

Abstract

Abstract: PFP_Growth algorithm is the parallelization of FP_Growth algorithm on the Hadoop platform based on MapReduce. The algorithm does not consider the balance of the load while grouping the transaction set, which causes the time inconsistency of different nodes to accomplish the tasks and even a bigger difference. Thus, it reduces the efficiency of the algorithm. To improve the efficiency of the algorithm, this paper proposes a Spark-based RPFP algorithm, which optimizes PFP_Growth algorithm through balancing the groups and reducing the time complexity. To balance the group, the large load is placed into the group with the smallest total load. The address of the element is fast accessed by adding a Hash table to the head table, which reduces the time complexity. Experimental results show that RPFP algorithm effectively improves the mining efficiency of the frequent itemsets.

Key words: FP_Growth algorithm, frequent itemset mining, load balance, head table, Spark

摘要： PFP_Growth算法是FP_Growth算法在Hadoop平台上基于MapReduce的并行化，该算法在分组过程中没有考虑负载均衡问题，导致各个节点完成任务时间不一致，甚至相差很大，从而降低了算法的执行效率。为了提高算法的执行效率，提出了一种基于Spark的RPFP算法，该算法对PFP_Growth算法在均衡分组和降低时间复杂度两方面进行优化，通过把负载大的项放在负载总和最小的组里面实现均衡分组，通过在链头表结构中加入一张哈希表达到快速访问元素地址的目的，从而降低时间复杂度。实验结果表明，RPFP通过优化PFP算法，有效提高了频繁项集的挖掘效率。

关键词: FP_Growth算法, 频繁项集挖掘, 负载均衡, 链头表结构, Spark

SHI Lukui1, ZHANG Xin1, SHI Shengli2. Parallelization and optimization of FP_Growth algorithm based on Spark[J]. Computer Engineering and Applications, 2018, 54(13): 52-58.

石陆魁1，张欣1，师胜利2. 基于Spark的FP_Growth算法的并行与优化[J]. 计算机工程与应用, 2018, 54(13): 52-58.

[1]	LI Junli. Parallel Mutual-Information Computation of Categorical Data Based on Spark [J]. Computer Engineering and Applications, 2021, 57(7): 95-100.
[2]	LI Shuo, LIANG Yi. Prediction Model of Execution Time for Batch Application in Spark [J]. Computer Engineering and Applications, 2021, 57(5): 79-87.
[3]	LEI Dingyou, HONG Shuhua, ZHANG Yinggui. Model and Algorithm for Container Mixed Balanced Loading of Light and Heavy Cargo [J]. Computer Engineering and Applications, 2020, 56(8): 233-240.
[4]	LI Chao, DONG Xinhua, CHEN Jianxia. Asynchronous Iterative Updates Method Based on Subgraph in Spark [J]. Computer Engineering and Applications, 2020, 56(7): 67-73.
[5]	HE Feng, ZENG Wen, WANG Bingjun. Design and Implementation of Parallel Real-Time Storage System for TT&C Data [J]. Computer Engineering and Applications, 2020, 56(23): 253-258.
[6]	WEI Zhanchen, LIU Xiaoyu, HUANG Qiulan, SUN Gongxing. Research on Optimization for Iteration-Intensive Applications on Spark [J]. Computer Engineering and Applications, 2020, 56(23): 68-73.
[7]	WANG Yonggui, GUO Xintong. Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql [J]. Computer Engineering and Applications, 2020, 56(21): 72-78.
[8]	HU Yang, HU Xuegang, LI Peipei. Fast Short Text Data Stream Classification Method Based on Spark [J]. Computer Engineering and Applications, 2020, 56(14): 138-147.
[9]	LIU Jiayao, WANG Jiabin. Improvement of Slope One Algorithm and Its Implementation on Big Data Platform [J]. Computer Engineering and Applications, 2020, 56(1): 83-91.
[10]	LIU Liping1, ZHANG Xinyou1, NIU Xiaolu2, GUO Yongkun1, DING Liang1. Survey of Spark-Based Parallel Association Rules Mining Algorithm [J]. Computer Engineering and Applications, 2019, 55(9): 1-9.
[11]	WANG Jin1, LI Qi2, HUANG Jiawei2. Path Difference Aware Packet Scattering Strategy [J]. Computer Engineering and Applications, 2019, 55(5): 72-75.
[12]	CHEN Xining1，2, MA Weiyin3, LI Li4. Fingerprint Localization Data Processing Method Based on Spark [J]. Computer Engineering and Applications, 2019, 55(4): 79-83.
[13]	TAN Di, DUAN Guihua, WANG Jianxin, REN Linan. Research on Prediction and Alarm of Transaction Volume Oriented to Banking Business [J]. Computer Engineering and Applications, 2019, 55(12): 220-224.
[14]	QU Zhaoyang1，2, FENG Rongqiang1，2, QU Nan3, XIE Shuya1，2, LIU Yaowei4, YAN Jia4. Recommendation Method of Power Selling Packages Considering Spark and Attribute Weights [J]. Computer Engineering and Applications, 2019, 55(10): 90-95.
[15]	ZENG Youling, CHEN Gengduo, XIONG Wei, LI Zhe. Parallel Design of FBP Reconstruction Algorithm for CT Image Based on Spark [J]. Computer Engineering and Applications, 2019, 55(10): 218-224.

Parallelization and optimization of FP_Growth algorithm based on Spark

基于Spark的FP_Growth算法的并行与优化

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics