计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (7): 106-115.DOI: 10.3778/j.issn.1002-8331.2103-0019

• 大数据与云计算 • 上一篇    下一篇

结合信息论改进的并行深度森林算法

毛伊敏,耿俊豪,陈亮   

  1. 1.江西理工大学 信息工程学院,江西 赣州 341000
    2.江西理工大学 应用科学学院,江西 赣州 341000
  • 出版日期:2022-04-01 发布日期:2022-04-01

Improved Parallel Deep Forest Algorithm Combining with Information Theory

MAO Yimin, GENG Junhao, CHEN Liang   

  1. 1.School of Information Engineering, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China
    2.School of Applied Science, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China
  • Online:2022-04-01 Published:2022-04-01

摘要: 针对并行深度森林算法在处理大数据问题时存在的冗余与不相关特征过多,多粒度扫描不平衡以及并行化效率低等问题,提出了大数据环境下基于信息论改进的并行深度森林算法——IPDFIT(improved parallel deep forest based on information theory)。该算法基于信息论设计了一种混合降维策略DRIT(dimension reduction based on information theory),以获得降维后的数据集,有效减少了冗余及不相关特征的数量;提出了一种改进的多粒度扫描策略IMGSS(improved multi-grained scanning strategy)对样本进行扫描,保证每个特征在扫描后,同频率出现在数据子集中,避免了因多粒度扫描不平衡对深度森林模型的影响;结合MapReduce框架,对深度森林每层级联结构中的随机森林模型进行并行化训练,同时提出了一种样本加权策略TSWS(the sample weighting strategy),根据级联中随机森林模型对样本进行评估,选取评估结果较差的样本进入下一层训练,逐步减少了每层级中训练样本的数量,从而提高了算法的并行效率。实验结果表明,该算法在大数据环境下,尤其是针对特征数较多的数据集有着更好的分类效果。

关键词: MapReduce框架, 深度森林, DRIT策略, IMGSS策略, TSWS策略

Abstract: Aiming at the problems of excessive redundancy and irrelevant features, multi-grained scanning imbalance and low parallelization efficiency in big data parallel deep forest algorithm, this paper proposes an improved parallel deep forest based on information theory, named IPDFIT. Firstly, a dimension reduction based on information theory is presented to reduce the dimensionality of the original data set. Secondly, an improved multi-grained scanning strategy IMGSS to ensure that each feature appears in the data subset with the same frequency. Finally, in order to improve the parallel efficiency of the deep forest algorithm, the sample weighting strategy is proposed to evaluate the sample according to the forest in the cascade. Based on the evaluate results, the algorithm selects samples with poor evaluation to enter the next layer of training. The experimental results show that the IPDFIT algorithm has a better classification results in a big data environment, especially for data sets with more features.

Key words: MapReduce framework, deep forest, DRIT strategy, IMGSS strategy, TSWS strategy