Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (20): 254-265.DOI: 10.3778/j.issn.1002-8331.2210-0230

• Big Data and Cloud Computing • Previous Articles     Next Articles

Data Stream Classification Method Combining Micro-Clustering and Active Learning

YIN Chunyong, CHEN Shuangshuang   

  1. School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China
  • Online:2023-10-15 Published:2023-10-15

结合微聚类和主动学习的流分类方法

尹春勇,陈双双   

  1. 南京信息工程大学 计算机学院、网络空间安全学院,南京 210044

Abstract: Data stream classification is an important research component in data mining, but the problems of concept drift and expensive labeling in data streams pose a great challenge to classification. Most of the existing research work adopts online classification technology based on active learning, which alleviates the problems of concept drift and limited labels to a certain extent. However, these methods are less efficient for classification and ignore the problem of memory overhead. Aiming at these problems, a data stream classification method combining micro-clustering and active learning is proposed(CALC). Firstly, a new active learning hybrid query strategy is proposed to measure the importance of each microcluster during maintenance by combining it with error-based representative learning. Secondly, a set of microclusters is dynamically maintained to accommodate the concept drift generated in the data stream. In addition, an inert microcluster-based learning approach is used to achieve classification of the data stream and to accomplish online updates of the cached microclusters. Finally, comparative experiments are conducted using three real datasets and three simulated synthetic datasets, and the results show that CALC outperforms existing data stream classification algorithms in terms of classification accuracy and memory overhead. Compared with the benchmark model ORSL, the classification accuracy of CALC has been improved to a certain extent, and the average accuracy of the six data sets has been increased by 5.07, 2.41, 1.04, 1.03, 3.47 and 0.64 percentage points, respectively.

Key words: active learning, data stream classification, micro-clustering, concept drift

摘要: 数据流分类是数据挖掘中重要的研究内容,但是数据流中的概念漂移和标记成本昂贵的问题给分类带来了巨大的挑战。现有的研究工作大多采用基于主动学习的在线分类技术,一定程度上缓解了概念漂移和有限标签的问题,但是这些方法的分类效率较低,并且忽略了内存开销的问题。针对这些问题提出了一种结合微聚类和主动学习的流分类方法(a data stream classification method combining micro-clustering and active learning,CALC)。提出一种新的主动学习混合查询策略,将其与基于错误的表示学习相结合,从而在维护过程中衡量每个微聚类的重要性,通过动态维护一组微聚类以适应数据流中产生的概念漂移。采用基于微聚类的惰性学习方法,实现对数据流的分类,并完成对缓存微聚类的在线更新。使用三个真实数据集和三个人工合成数据集进行实验,结果显示CALC在分类准确率和内存开销方面优于现有的数据流分类算法。与基准模型(online reliable semi-supervised learning on evolving data streams,ORSL)相比,CALC的分类准确率有一定的提升,在六个数据集上的平均准确率分别提高了5.07、2.41、1.04、1.03、3.47、0.64个百分点。

关键词: 主动学习, 数据流分类, 微聚类, 概念漂移