Spark下基于PCA和分层选择的随机森林算法

doi:10.3778/j.issn.1002-8331.2009-0316

摘要/Abstract

摘要： 针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题，提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF（PCA and subspace layer sampling on parallel random forest algorithm）。对初始特征集，提出了基于PCA的矩阵分解策略（matrix factorization strategy，MFS），压缩原始特征集，提取主成分特征，解决特征变换过程中协方差矩阵规模较大的问题；基于主成分特征，提出基于误差约束的分层子空间构造算法（error-constrained hierarchical subspace construction algorithm，EHSCA），分层选取信息素特征，构建特征子空间，解决子空间特征信息覆盖不足的问题；在Spark环境下并行化训练决策树的过程中，设计了一种数据复用策略（data reuse strategy，DRS），通过垂直划分RDD数据并结合索引表，实现特征复用，解决了节点通信开销大的问题。实验结果表明PLA-PRF算法分类效果更佳，并行化效率更高。

关键词: 随机森林, Spark, 主成分分析（PCA）, 分层抽样, 误差约束, 数据划分, 数据复用

Abstract: In the context of big data, the random forest algorithm has large covariance matrix, insufficient coverage of subspace feature information and high node communication overhead. A parallel random forest algorithm based on PCA and subspace hierarchical selection, PLA-PRF（PCA and subspace layer sampling on parallel random forest algorithm）. For the initial feature set, a PCA-based matrix factorization strategy（MFS） is proposed to extract principal component features to solve the problem of large covariance matrix in the process of feature transformation. Based on the obtained principal component features, a hierarchical subspace construction algorithm（error-constrained hierarchical subspace construction algorithm, EHSCA） based on error constraints is proposed, which selects pheromone features hierarchically, constructs feature subspaces, and solves the problem of insufficient coverage of subspace feature information. In the process of parallel training decision trees in the Spark environment, a data reuse strategy（DRS） is designed to solve the problem of high node communication overhead. By vertically dividing RDD data objects, it improves the performance of the distributed environment. Data utilization rate solves the problem of high node communication overhead. Experimental results show that PLA-PRF has better classification effect and higher parallelization efficiency.

Key words: random forest, Spark, princepal component analysis（PCA）, layer sampling, error constraint, data partition, data reuse

雷晨, 毛伊敏. Spark下基于PCA和分层选择的随机森林算法[J]. 计算机工程与应用, 2022, 58(6): 118-127.

LEI Chen, MAO Yimin. Random Forest Algorithm Based on PCA and Hierarchical Selection Under Spark[J]. Computer Engineering and Applications, 2022, 58(6): 118-127.

参考文献

[1] MANTAS C J，CASTELLANO J G，MORAL-GARCíA S，et al.A comparison of random forest based algorithms：random credal random forest versus oblique random forest[J].Soft Computing，2018，23（5）：10739-10754.
[2] 李建中，刘显敏.大数据的一个重要方面：数据可用性[J].计算机研究与发展，2013，50（6）：1147-1162.
LI J Z，LIU X M.An important aspect of big data：data usability[J].Journal of Computer Research and Development，2013，50（6）：1147-1162.
[3] KIM A，MYUNG J，KIM H.Random forest ensemble using a weight-adjusted voting algorithm[J].Journal of the Korean Data and Information Science Society，2020，31（2）：427-438.
[4] 胡俊，胡贤德，程家兴.基于Spark的大数据混合计算模型[J].计算机系统应用，2015，24（4）：214-218.
HU J，HU X D，CHENG J X.Big data hybrid computing model based on Spark[J].Computer System and Applications，2015，24（4）：214-218.
[5] LUNGA D，GERRAND J，YANG L，et al.Apache Spark accelerated deep learning inference for large scale satellite image analytics[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing，2020，13：271-283.
[6] WU Z，LIN W，ZHANG Z，et al.An ensemble random forest algorithm for insurance big data analysis[C]//IEEE International Conference on Computational Science & Engineering，2017.
[7] AZAR A T，INBARANI H H，DEVI K R.Improved dominance rough set-based classification system[J].Neural Computing and Applications，2017，28：2231-2246.
[8] BANIA R K，HALDER A.R-ensembler：a greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data[J].Computer Methods and Programs in Biomedicine，2020，184（4）.
[9] GALICIA A，TALAVERA-LLAMES R，TRONCOSO A，et al.Multi-step forecasting for big data time series based on ensemble learning[J].Knowledge-Based Systems，2018（6）.
[10] LULLI A，ONETO L，ANGUITA D.Mining big data with random forests[J].Cognitive Computation，2019，11（6）.
[11] MORFINO V，RAMPONE S.Towards near-real-time intrusion detection for iot devices using supervised learning and apache?Spark[J].Engineering，Electrical & Electronic，2020，9（3）.
[12] HAMSTRA M，ZAHARIA M.Learning Spark：lightning-fast big data analytics[M].[S.l.]：Orlly & Associates Inc，2016.
[13] 杨博雄，杨雨绮.利用PCA进行深度学习图像特征提取后的降维研究[J].计算机系统应用，2018，28（1）：279-283.
YANG B X，YANG Y Q.Applying PCA to on dimensionality reduction of image features extractied by deep learning[J].Computer System and Applications，2018，28（1）：279-283.
[14] 江俊彦，彭智勇，吴小莹,等.基于分层抽样的重叠深网数据源选择[J].软件学报，2017，28（5）：1271-1295.
JANG J Y，PENG Z Y，WU X Y.Overlapping deep Web data source selection based on stratified sampling[J].Journal of Software，2017，28（5）：1271-1295.
[15] RAM P，SINHA K.Revisiting Kd-tree for nearest neighbor search[C]//Proceedings of the Twenty-Fifth ACM SIGKDD International Conference on Knowledge Discovery and Datamining，2019.
[16] JOHNSON R W.An introduction to the bootstrap[J].Teaching Stats，2010，23（2）：49-54.
[17] SARVENDRANATH R，MEHTA N B.Antenna selection with power adaptation in interference-constrained cognitive radios[J].IEEE Transactions on Communications，2014，62（3）：786-796.
[18] CHEN H，CHANG P，HU Z，et al.A spark-based ant lion algorithm for parameters optimization of random forest in credit classification[C]//2019 IEEE 3rd Information Technology，Networking，Electronic and Automation Control Conference（ITNEC），2019.
[19] WHITE H S.Bootstrap confidence intervals for the correlation coefficient[J].IEEE Transactions on Communications，2019：786-796.
[20] WANG S K，DAI B R.A G-means update ensemble learning approach for the imbalanced data stream with concept drifts[C]//International Conference on Big Data Analytics and Knowledge Discovery，2016.