计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (22): 160-165.DOI: 10.3778/j.issn.1002-8331.1908-0416

• 模式识别与人工智能 • 上一篇    下一篇

基于约简属性和阈值分割的决策树构建方法

谭正华,戴立平,文阳,李国泰   

  1. 湘潭大学 信息工程学院,湖南 湘潭 411105
  • 出版日期:2020-11-15 发布日期:2020-11-13

Decision Tree Construction Method Based on Reduction Attribute and Threshold Segmentation

TAN Zhenghua, DAI Liping, WEN Yang, LI Guotai   

  1. College of Information Engineering, Xiangtan University, Xiangtan, Hunan 411105, China
  • Online:2020-11-15 Published:2020-11-13

摘要:

针对决策树C4.5算法在处理连续值属性过程中时间复杂度较高的问题,提出一种新的决策树构建方法:采用概率论中属性间的相关系数(Pearson),对数据集中的属性进行约简;结合属性的信息增益率,保留决策属性的最优子集,保证属性子集中没有冗余属性;采用边界点的判定,改进了连续值属性离散化过程中阈值分割方法,对信息增益率的计算进行修正。采用UCI数据库中的数据集,在Pycharm平台上进行一系列对比实验,结果表明:采用改进后C4.5决策树算法,决策树生成效率提高了约50%,准确率提升约2%,比较有效地解决了原C4.5算法属性选择偏连续值属性的问题。

关键词: 决策树, 冗余属性, 边界点, 阈值分割, 信息增益率

Abstract:

Aimed to solve the problem that the decision tree C4.5 algorithm has higher time complexity in the process of processing continuous value attributes, a new decision tree construction method is proposed. It uses the correlation coefficient between attributes(Pearson) in the probability set for the reduction of data-concentrated attributes, applies the information gain rate of the attribute and retains the optimal subset of the decision attribute to ensure that there is no redundant attribute in the attribute subset. What’s more, it also uses the decision of the boundary point to improve the threshold segmentation method in the discretization process of continuous value attributes and correct the calculation of the information gain rate. Using a data set in the UCI database, a series of comparative experiments are performed on the Pycharm platform. The results show that with the improved C4.5 decision tree algorithm, the decision tree generation efficiency is improved by about 50%, and the accuracy rate is improved by about 2%. The problem of the original C4.5 algorithm attribute selection partial continuous value attribute is solved.

Key words: decision tree, redundant attribute, boundary point, threshold segmentation, information gain rate