Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (12): 139-143.

Previous Articles     Next Articles

Information entropy based subspace clustering algorithm

LIU Jingjie1, TAO Liang2   

  1. 1.Department of Computer Technology, Anhui Vocational and Technical College of Industry and Trade, Huainan, Anhui 232007, China
    2.School of Computer Science and Technology, Anhui University, Hefei 230601, China
  • Online:2012-04-21 Published:2012-04-20

一种基于信息熵的子空间聚类算法

刘竞杰1,陶  亮2   

  1. 1.安徽工贸职业技术学院 计算机技术系,安徽 淮南 232007
    2.安徽大学 计算机科学与技术学院,合肥 230601

Abstract: A new method for estimating probability density of data distribution on data streams which is a more reasonable strategy for fading the old data is proposed. Based on Parzen method, the information entropy of the subspace of the data set can be calculated. Based on the close relationship between entropy and distribution, an effective algorithm based on entropy for clustering high dimensional data streams called PStream is also developed. The theoretical and simulation results show that compared with the previous results, PStream algorithm scans over the data stream in only a single pass and has a high clustering precision although it is not much more efficient than the previous method such as HPStream. 

Key words: data streams, clustering, high dimension, subspace, data mining

摘要: 结合传统的Parzen窗方法并引入一种更加合理的历史数据丢弃策略,在此基础上,通过计算可以得到整个数据集在低维空间投影的信息熵,利用信息熵实现了一种适用于高维数据流的子空间聚类算法(PStream)。理论及实验均表明,与传统的算法相比,该算法可以在一次遍历的前提下,完成对数据流的高精度聚类,虽然其运行效率与现有的方法(如HPStream)相比差别不大,但是却明显地改善了聚类效果。

关键词: 数据流, 聚类, 高维, 子空间, 数据挖掘