计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (6): 168-175.DOI: 10.3778/j.issn.1002-8331.1911-0185

• 图形图像处理 • 上一篇    下一篇

随机森林的可解释性可视分析方法研究

杨晔民,张慧军,张小龙   

  1. 1.太原理工大学 信息与计算机学院,山西 晋中 030600
    2.山西传媒学院 融媒技术学院,山西 晋中 030619
  • 出版日期:2021-03-15 发布日期:2021-03-12

Research on Interpretable Visual Analysis Method of Random Forest

YANG Yemin, ZHANG Huijun, ZHANG Xiaolong   

  1. 1.College of Information and Computer, Taiyuan University of Technology, Jinzhong, Shanxi 030600, China
    2.College of Media Technology, Communication University of Shanxi, Jinzhong, Shanxi 030619, China
  • Online:2021-03-15 Published:2021-03-12

摘要:

由于随机森林算法在很多情况下都以“黑盒”的方式存在,对于用户而言,参数调整,训练甚至最终构建的模型细节是隐蔽的,这导致了随机森林模型的可解释性非常差,在一定程度上阻碍了该模型在一些诸如医学诊断、司法、安全领域等需要透明化和可解释需求比较高的领域使用。影响该模型可解释性挑战主要来源于特征选择和数据的随机性。同时随机森林包含许多决策树,用户很难理解和比较所有决策树的结构和属性。为了解决上述问题,设计并实现了可视分析系统FORESTVis,该系统包括树视图、部分依赖图、t-SNE投影图、特征视图等多个交互式可视化组件,借助该系统,相关研究人员和从业人员可以直观地了解随机森林的基本结构和工作机制,并协助用户对模型的性能进行评估。使用Kaggle公开数据集上进行案例分析,验证了该方法的可行性和有效性。

关键词: 随机森林, 可视分析, 交互设计, 可解释机器学习

Abstract:

Random forests are typically applied in a black-box manner where the details of parameters tuning, training and even the final constructed model are hidden from the users in most cases. It leads to a poor model interpretability, which significantly hinders the model from being used in fields that require transparent and explainable predictions, such as medical diagnostics, justice, and security to some extent. The interpretation challenges stem from the randomicity of feature selection and data. Furthermore, random forests contain many decision trees, it is difficult or even impossible for users to understand and compare the structures and properties of all decision trees. To tackle these issues, an interactive visual analytics system FORESTVis is designed, it includes tree view, partial dependence plots, t-SNE projection, feature view and other interactive visual components. The researchers and practitioners of the model can intuitively understand the basic structures and working mechanism of random forests and assist users in evaluating the performance of models through interactive exploration. Finally, a case study using the Kaggle public dataset shows that the method is feasible and effective.

Key words: random forests, visual analysis, interaction design, interpretable machine learning