计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (6): 124-128.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

针对标记数据不足的数据流分类器

熊忠阳,周兴勤,张玉芳   

  1. 重庆大学 计算机学院,重庆 400030
  • 出版日期:2015-03-15 发布日期:2015-03-13

Data stream classifier with limited labelled data

XIONG Zhongyang, ZHOU Xingqin, ZHANG Yufang   

  1. School of Computer Science, Chongqing University, Chongqing 400030, China
  • Online:2015-03-15 Published:2015-03-13

摘要: 大部分数据流分类算法解决了数据流无限长度和概念漂移这两个问题。但是,这些算法需要人工专家将全部实例都标记好作为训练集来训练分类器,这在数据流高速到达并需要快速分类的环境中是不现实的,因为标记实例需要时间和成本。此时,如果采用监督学习的方法来训练分类器,由于标记数据稀少将得到一个弱分类器。提出一种基于主动学习的数据流分类算法,该算法通过选择全部实例中的一小部分来人工标记,其中这小部分实例是分类置信度较低的样本,从而可以极大地减少需要人工标记的实例数量。实验结果表明,该算法可以在数据流存在概念漂移情况下,使用较少的标记数据对数据流训练出分类器,并且分类效果良好。

关键词: 数据流, 分类, 概念漂移, 主动学习

Abstract: Most algorithms for data streams have addressed the problems of infinite length and concept drifting. However, These algorithms need all instances to be labelled by human experts and then they use them as training set to get a classifier. It is impractical in a high-speed data stream environment because labelling instances are both time consuming and costly. Then if just using supervised learning method to train a classifier, a small number of labelled instances will get a poor classifier. This paper proposes a classification algorithm for data stream based on active learning. The method selects a small part of instances to be labelled, which have low confidence when classifying. Thus the number of instances needed to be labeled is greatly reduced. The experimental results show that the proposed method can use a small number of labelled data to classify the concept-drifting data streams correctly.

Key words: data streams, classification, concept drifting, active learning