计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (13): 255-258.

• 工程与应用 • 上一篇    下一篇

决策树C4.5算法的优化与应用

苗煜飞1,张霄宏1,2   

  1. 1.河南理工大学 计算机科学与技术学院,河南 焦作 454000
    2.中国科学院 深圳先进技术研究院,广东 深圳 518055
  • 出版日期:2015-07-01 发布日期:2015-06-30

Improvement and application of C4.5 decision tree algorithm

MIAO Yufei1, ZHANG Xiaohong1,2   

  1. 1.College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454000, China
    2.Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
  • Online:2015-07-01 Published:2015-06-30

摘要: C4.5算法作为目前最具影响力的决策树分类算法,仍存一些不足之处。针对C4.5算法在对连续值属性离散化处理过程中比较耗时的缺点,基于Fayyad和Irani的边界定理,在连续属性离散化之后使用Gini指标代替信息熵对算法进行了化简。针对决策树算法中的过度拟合问题,基于Occam’s razor,采用再带入估计,对算法进行了改进。将上述思想应用于金融借贷数据,实验结果表明,改进的C4.5算法在保证准确率的前提下,执行时间平均降低8.74%,模型复杂度平均降低6.26%,表明了该算法的有效性。

关键词: C4.5算法, 边界定理, Gini指标, 奥卡姆剃刀, 再带入估计

Abstract: C4.5 is the most influential decision tree classified algorithm, but it still has some deficiencies. To improve the deficiency of consuming more time in discretizing continuous-valued attributes using C4.5 algorithm, a new simplified algorithm is proposed by using Gini index to replace information entropy after discretizing continuous-valued attributes based on Fayyad and Irani boundary theory. To solving the over fitting problem in decision tree method, the improved algorithm is considered by using resubstitution estimate based on Occam’s razor. Applying the idea above to financial loan data, experimental results show that the execution time is reduced by an average of 8.74%, and that the model complexity is reduced by an average of 6.26% by using the improved C4.5 algorithm under the premise of guaranteeing the accuracy. Finally, the experimental results verify the validity of this algorithm.

Key words: C4.5 algorithm, boundary theorem, Gini index, Occam’s razor, resubstitution estimate