计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (6): 118-127.DOI: 10.3778/j.issn.1002-8331.2009-0316

• 模式识别与人工智能 • 上一篇    下一篇

Spark下基于PCA和分层选择的随机森林算法

雷晨,毛伊敏   

  1. 江西理工大学 信息工程学院,江西 赣州 341000
  • 出版日期:2022-03-15 发布日期:2022-03-15

Random Forest Algorithm Based on PCA and Hierarchical Selection Under Spark

LEI Chen, MAO Yimin   

  1. School of Information Engineering, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China
  • Online:2022-03-15 Published:2022-03-15

摘要: 针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题,提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm)。对初始特征集,提出了基于PCA的矩阵分解策略(matrix factorization strategy,MFS),压缩原始特征集,提取主成分特征,解决特征变换过程中协方差矩阵规模较大的问题;基于主成分特征,提出基于误差约束的分层子空间构造算法(error-constrained hierarchical subspace construction algorithm,EHSCA),分层选取信息素特征,构建特征子空间,解决子空间特征信息覆盖不足的问题;在Spark环境下并行化训练决策树的过程中,设计了一种数据复用策略(data reuse strategy,DRS),通过垂直划分RDD数据并结合索引表,实现特征复用,解决了节点通信开销大的问题。实验结果表明PLA-PRF算法分类效果更佳,并行化效率更高。

关键词: 随机森林, Spark, 主成分分析(PCA), 分层抽样, 误差约束, 数据划分, 数据复用

Abstract: In the context of big data, the random forest algorithm has large covariance matrix, insufficient coverage of subspace feature information and high node communication overhead. A parallel random forest algorithm based on PCA and subspace hierarchical selection, PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm). For the initial feature set, a PCA-based matrix factorization strategy(MFS) is proposed to extract principal component features to solve the problem of large covariance matrix in the process of feature transformation. Based on the obtained principal component features, a hierarchical subspace construction algorithm(error-constrained hierarchical subspace construction algorithm, EHSCA) based on error constraints is proposed, which selects pheromone features hierarchically, constructs feature subspaces, and solves the problem of insufficient coverage of subspace feature information. In the process of parallel training decision trees in the Spark environment, a data reuse strategy(DRS) is designed to solve the problem of high node communication overhead. By vertically dividing RDD data objects, it improves the performance of the distributed environment. Data utilization rate solves the problem of high node communication overhead. Experimental results show that PLA-PRF has better classification effect and higher parallelization efficiency.

Key words: random forest, Spark, princepal component analysis(PCA), layer sampling, error constraint, data partition, data reuse