计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (20): 77-84.DOI: 10.3778/j.issn.1002-8331.2207-0443

• 理论与研发 • 上一篇    下一篇

一种融合近似马尔科夫毯的随机森林优化算法

罗计根,熊玲珠,杜建强,聂斌,熊旺平,李郅琴   

  1. 1.江西中医药大学 计算机学院,南昌 330004
    2.江西师范大学 信息化办公室,南昌 330022
  • 出版日期:2023-10-15 发布日期:2023-10-15

Random Forest Optimization Algorithm Fusion with Approximate Markov Blanket

LUO Jigen, XIONG Lingzhu, DU Jianqiang, NIE Bin, XIONG Wangping, LI Zhiqin   

  1. 1.College of Computer Science, Jiangxi University of Chinese Medicine, Nanchang 330004, China
    2.Information Office, Jiangxi Normal University, Nanchang 330022, China
  • Online:2023-10-15 Published:2023-10-15

摘要: 特征的相关和冗余,会直接影响随机森林随机抽取特征的质量,导致随机森林的收敛性减弱,降低随机森林模型的准确度、泛化能力及性能。基于此,提出一种融合近似马尔科夫毯的随机森林优化算法,该算法利用近似马尔科夫毯构建相似特征组,再从每个相似组中按比例抽取特征形成单棵决策树的特征子集,重复上述过程直至达到随机森林规模。该算法可以在保证随机森林特征的多样性前提下,利用近似马尔科夫毯消除特征间的相关性和冗余性,提高随机抽取特征的质量。通过在12组不同维度UCI数据集实验对比表明:融合近似马尔科夫毯的随机森林在一定程度上可以消除特征相关性和冗余性,提高模型的各项评价指标,泛化能力增强,更适用于高维数据。

关键词: 随机森林, 近似马尔科夫毯, 特征选择, 高维样本

Abstract: The correlation and redundancy of features will directly affect the quality of randomly extracted features of random forests, leading to the weakened convergence of random forests and reducing the accuracy, generalization ability and performance of random forest models. Based on this, this paper proposes a random forest optimization algorithm incorporating approximate Markov blankets, which uses approximate Markov blankets to construct similar feature groups, then draws features from each similar group proportionally to form a feature subset of a single decision tree, and repeats the above process until it reaches the size of the random forest. The algorithm can improve the quality of randomly extracted features by eliminating the correlation and redundancy among features using approximate Markov blankets while ensuring the diversity of random forest features. The experimental comparison on 12 different dimensional UCI datasets shows that the random forest incorporating approximate Markov blanket can eliminate feature correlation and redundancy to a certain extent, improve various evaluation indexes of the model, enhance generalization ability, and be more suitable for high-dimensional data.

Key words: random forest, approximate Markov blanket, feature selection, high dimensional samples