计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (1): 159-161.

• 数据库与信息处理 • 上一篇    下一篇

基于样本分布与熵的数值型属性离散化

林永民1,吕震宇1,赵 爽1,朱卫东2   

  1. 1.河北理工大学 经济管理学院,河北 唐山 063009
    2.北京交通大学 计算机与信息技术学院,北京 100044
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-01-01 发布日期:2008-01-01
  • 通讯作者: 林永民

Discretization of numeric attribute based on example distribution and entropy

LIN Yong-min1,LU Zhen-yu1,ZHAO Shuang1,ZHU Wei-dong2   

  1. 1.College of Economics and Management,Hebei Polytechnic University,Tangshan,Hebei 063009,China
    2.School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-01-01 Published:2008-01-01
  • Contact: LIN Yong-min

摘要: 连续属性的离散化是数据预处理的重要工作。论文分析了基于熵的离散化方法的不足,从估计训练样本的概率分布的角度出发,提出基于样本分布与熵相结合的处理数值型属性的方法。基于UCI数据的实验结果表明,该方法不仅具有比较好的判决精度,而且具有更快的计算速度。

关键词: 数值型属性, 熵, 样本分布, 离散化

Abstract: Discretization of numeric attribute is an important role of data preprocessing.A heavy analysis about discretization method based on entropy is given.By the method of estimating the probability distribution of training examples,a new and simple method of dealing with numeric attribute based on example distribution and entropy is turned out.Experimental results of UCI data sets show that the proposed method has good performance on accuracy issue and the computational speed is heightened greatly.

Key words: numeric attribute, entropy, distribution of training examples, discretization