Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (23): 125-130.DOI: 10.3778/j.issn.1002-8331.1808-0266

Previous Articles     Next Articles

Research of Random Forests Combining with Factor Analysis

LI Huan, XIONG Mengying, NIE Bin, DU Jianqiang, ZHOU Li, HUANG Qiang   

  1. School of Computer, Jiangxi University of Traditional Chinese Medicine, Nanchang 330000, China
  • Online:2019-12-01 Published:2019-12-11



  1. 江西中医药大学 计算机学院,南昌 330000

Abstract: Affected by the imbalance of feature importance, random forests may randomly extract weak feature subsets to generate a “weak decision tree”, which leads to a decrease in the convergence speed of the model and a decrease in the performance of the model. In view of this, this paper proposes a random forest model of fusion factor analysis. The main innovation is to construct a feature set by factor analysis method, and then form a candidate subset of each split node according to the feature number and random extraction feature. Based on the model’s classification prediction, regression fitting, accuracy and running time of feature importance analysis, the overall performance of 9 UCI data comprehensive survey models is selected, and compared with decision trees and random forests. The results show that the random forest model of fusion factor analysis basically eliminates the decision tree with low accuracy, improves the accuracy and convergence speed, and is more generalized, which is more conducive to high-dimensional big data, feasible and effective.

Key words: random forest, factor analysis, classification, regression, importance of feature, traditional Chinese medicine informatics

摘要: 受特征重要性不平衡的影响,随机森林可能随机抽取到弱特征子集,从而生成“弱决策树”,进而导致模型的收敛速度降低、模型的性能下降。鉴于此,提出融合因子分析的随机森林模型,主要创新在于采用因子分析法构建特征组,再按特征个数比随机抽取特征形成每个分裂节点的候选子集。以模型的分类预测、回归拟合、特征重要性分析的准确率和运行时间为评价指标,选取了9组UCI数据综合考察模型的整体性能,并与决策树、随机森林对比实验。结果表明:融合因子分析的随机森林模型基本消除了准确率低的决策树产生,提高了模型的准确率和收敛速度,泛化性更强,更加有利于高维大数据,可行有效。

关键词: 随机森林, 因子分析, 分类, 回归, 特征重要性, 中医药信息学