计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (8): 118-121.

• 数据库、信号与信息处理 • 上一篇    下一篇

结合自助抽样的动态数据流贝叶斯分类算法

琚春华,殷贤君,许翀寰   

  1. 浙江工商大学 计算机与信息工程学院,杭州 310018
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-03-11 发布日期:2011-03-11

Bayesian classification algorithm of dynamic data stream based on bootstrap

JU Chunhua,YIN Xianjun,XU Chonghuan   

  1. College of Computer Science & Information Engineering,Zhejiang Gongshang University,Hangzhou 310018,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-03-11 Published:2011-03-11

摘要: 动态数据流具有数据量大、变化快、随机存取代价高、详细数据难以存储等特点,挖掘动态数据流对计算能力与存储能力要求非常高。针对动态数据流的以上特点,设计了一种基于自助抽样的动态数据流贝叶斯分类算法,算法运用滑动窗口模型对动态数据流进行处理分析。该模型以每个窗口的数据为基本单位,对窗口内的数据进行处理分析;算法采用自助抽样技术对待分类数据中的属性进行裁剪和优化,解决了数据属性间的多重线性相关问题;算法结合贝叶斯算法的特点,采用动态增量存储树来解决动态样本数据流的存储问题,实现了无限动态数据流无信息失真的静态有限存储,解决了动态数据流挖掘最大的难题——数据存储;对优化的待分类数据使用all-贝叶斯分类器和k-贝叶斯分类器进行分类,结合数据流的特性对两个分类器进行实时更新。该算法有效克服了贝叶斯分类属性独立性的约束和传统贝叶斯只对静态数据分类的缺点,克服了动态数据流最大的难题——数据存储问题。通过实验测试证明,基于自助抽样的贝叶斯分类具有很高的时效性和精确性。

关键词: 数据流, 自助抽样, 贝叶斯分类, 滑动窗口, 增量存储树

Abstract: Dynamic data streams have features of large data,instant change,costly random access and difficult storage of detailed data,so mining of such dynamic data streams puts forwards high requirements on the computing power and storage capacity.According to the above features,a Bayesian classification algorithm of dynamic data stream based on bootstrap is proposed to process and analyze dynamic data streams with the sliding window model.This model,taking data of each window as the basic unit,processes and analyzes the data of windows.The algorithm adopts the bootstrap method to cut and optimize the attributes of data to be classified,solving the problem in multi-linear inter-relation between data attributes.The algorithm,combining characteristics of Bayesian algorithm,adopts the dynamic incremental storage tree to store the dynamic sample data stream to realize the static finite storage of infinite dynamic data streams without distortion of information and ultimately solve the biggest problem in dynamic data stream mining——data storage.The all-Bayesian classifier and k-Bayesian classifier are adopted to classify the optimized data,and their updates are made according to the features of data streams.This algorithm overcomes the attribute independence of the Bayesian classifier and its limitation only to the static data.It overcomes the biggest problem of dynamic data stream——the data storage.Experimental tests prove that the Bayesian classification algorithm based on bootstrap has high timeliness and accuracy.

Key words: data stream, bootstrap, Bayesian classification, sliding window, incremental storage tree