结合信息论改进的并行深度森林算法

doi:10.3778/j.issn.1002-8331.2103-0019

摘要/Abstract

摘要： 针对并行深度森林算法在处理大数据问题时存在的冗余与不相关特征过多，多粒度扫描不平衡以及并行化效率低等问题，提出了大数据环境下基于信息论改进的并行深度森林算法——IPDFIT（improved parallel deep forest based on information theory）。该算法基于信息论设计了一种混合降维策略DRIT（dimension reduction based on information theory），以获得降维后的数据集，有效减少了冗余及不相关特征的数量；提出了一种改进的多粒度扫描策略IMGSS（improved multi-grained scanning strategy）对样本进行扫描，保证每个特征在扫描后，同频率出现在数据子集中，避免了因多粒度扫描不平衡对深度森林模型的影响；结合MapReduce框架，对深度森林每层级联结构中的随机森林模型进行并行化训练，同时提出了一种样本加权策略TSWS（the sample weighting strategy），根据级联中随机森林模型对样本进行评估，选取评估结果较差的样本进入下一层训练，逐步减少了每层级中训练样本的数量，从而提高了算法的并行效率。实验结果表明，该算法在大数据环境下，尤其是针对特征数较多的数据集有着更好的分类效果。

关键词: MapReduce框架, 深度森林, DRIT策略, IMGSS策略, TSWS策略

Abstract: Aiming at the problems of excessive redundancy and irrelevant features, multi-grained scanning imbalance and low parallelization efficiency in big data parallel deep forest algorithm, this paper proposes an improved parallel deep forest based on information theory, named IPDFIT. Firstly, a dimension reduction based on information theory is presented to reduce the dimensionality of the original data set. Secondly, an improved multi-grained scanning strategy IMGSS to ensure that each feature appears in the data subset with the same frequency. Finally, in order to improve the parallel efficiency of the deep forest algorithm, the sample weighting strategy is proposed to evaluate the sample according to the forest in the cascade. Based on the evaluate results, the algorithm selects samples with poor evaluation to enter the next layer of training. The experimental results show that the IPDFIT algorithm has a better classification results in a big data environment, especially for data sets with more features.

Key words: MapReduce framework, deep forest, DRIT strategy, IMGSS strategy, TSWS strategy

毛伊敏, 耿俊豪, 陈亮. 结合信息论改进的并行深度森林算法[J]. 计算机工程与应用, 2022, 58(7): 106-115.

MAO Yimin, GENG Junhao, CHEN Liang. Improved Parallel Deep Forest Algorithm Combining with Information Theory[J]. Computer Engineering and Applications, 2022, 58(7): 106-115.

参考文献

[1] LIU C，LAI C F，JIANG R K，et al.Visualization driven by deep learning[J].Computer-Aided Design & Computer Graphics，2020，32（10）：1537-1548.
[2] LIU W，WANG Z，LIU X，et al.A survey of deep neural network architectures and their applications[J].Neurocomputing，2017，234：11-26.
[3] PAWIAK P，ABDAR M，PAWIAK J，et al.DGHNL：A new deep genetic hierarchical network of learners for prediction of credit scoring[J].Information Sciences，2020，516：401-418.
[4] FEI W，QIN J.Research on intelligent fault diagnosis of mechanical equipment based on sparse deep neural networks[J].Journal of Vibro Engineering，2017，19（4）：2439-2455.
[5] JELENA K.An end-to-end deep neural network for autonomous driving designed for embedded automotive platforms[J].Sensors，2019，19（9）：31058820.
[6] ZHAO H K，WU L K，LI Z，et al.Predicting the dynamics in internet finance based on deep neural network structure[J].Computer Research and Development，2019，56（8）：1621-1631.
[7] ZHOU T，SUN X，XIA X，et al.Improving defect prediction with deep forest[J].Information and Software Technology，2019，114：204-216.
[8] ZHOU Z H，FENG J.Deep forest：Towards an alternative to deep neural networks[J].arXiv：1702.08835，2017.
[9] DAI H N，WONG C W，WANG H，et al.Big data analytics for large scale wireless networks：Challenges and opportunities[J].ACM Computing Surveys，2020，52（5）：1-36.
[10] HASHEM I A T，ANUAR N B，MARJANI M，et al.MapReduce scheduling algorithms：A review[J].The Journal of Supercomputing，2021，76（7）：4915-4945.
[11] ALNASIR J J，SHANAHAN H P.The application of Hadoop in structural bioinformatics[J].Briefings in Bioinformatics，doi：10.1093/bib/bby106.
[12] XIAO W，HU J.A survey of parallel clustering algorithms based on spark[J].Scientific Programming，2020（5）：1-12.
[13] MORITZ P，NISHIHARA R，WANG S，et al.Ray：A distributed framework for emerging AI applications[C]//Proceedings of the 13th USENIX Symposium on Operating System Design and Implementation，2018.
[14] 宋杰，张宇哲，毛克明，等.MapReduce大数据处理平台算法研究进展[J].软件学报，2017，28（3）：514-543.
SONG J，SUN Z Z，MAO K M，et al.Research advance on MapReduce based big data processing platforms and algorithms[J].Journal of Software，2017，28（3）：514-543.
[15] ZHU G，HU Q，GU R，et al.ForestLayer：Efficient training of deep forests on distributed task-parallel platforms[J].Journal of Parallel and Distributed Computing，2019，132：113-126.
[16] SENA I，DILLAK J W，LEUNUPUN P，et al.Predicting rainfall intensity using na?ve bayes and information gain methods[].Journal of Physics（Conference Series），2019，1577：12-13.
[17] ZHANG F，GAO W.Feature selection considering weighted relevancy[J].Applied Intelligence，2018，48（12）：4615-4625.
[18] MURTAG F，CONTRERAS P.Algorithms for hierarchical clustering：An overview，II[J].Wiley Interdisciplinary Reviews：Data Mining and Knowledge Discovery，2017，76（6）：e1219.
[19] AIT-SAHALIA Y，XIU D.Principal component analysis of high-frequency data[J].Journal of the American Statistical Association，2019，114：287-303.

编辑推荐

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	73

	来源	本网站

	次数	73
	比例	100%

摘要

239

最新录用	在线预览	正式出版

0	0	239

	来源	本网站

	次数	239
	比例	100%