计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (11): 116-121.DOI: 10.3778/j.issn.1002-8331.1701-0118

• 模式识别与人工智能 • 上一篇    下一篇

结合半监督与主动学习的时间序列PU问题分类

陈  娟1,朱福喜1,2   

  1. 1.武汉大学 计算机学院,武汉 430072
    2.汉口学院 计算机科学与技术学院,武汉 430212
  • 出版日期:2018-06-01 发布日期:2018-06-14

Time series classification based on PU problem with semi-supervised learning and active learning

CHEN Juan1, ZHU Fuxi1,2   

  1. 1.Computer School, Wuhan University, Wuhan 430072, China
    2.School of Computer Science and Technology, Hankou University, Wuhan 430212, China
  • Online:2018-06-01 Published:2018-06-14

摘要: 目前基于PU问题的时间序列分类常采用半监督学习对未标注数据集[U]中数据进行自动标注并构建分类器,但在这种方法中,边界数据样本类别的自动标注难以保证正确性,从而导致构建分类器的效果不佳。针对以上问题,提出一种采用主动学习对未标注数据集[U]中数据进行人工标注从而构建分类器的方法OAL(Only Active Learning),基于投票委员会(QBC)对标注数据集构建多个分类器进行投票,以计算未标注数据样本的类别不一致性,并综合考虑数据样本的分布密度,计算数据样本的信息量,作为主动学习的数据选择策略。鉴于人工标注数据量有限,在上述OAL方法的基础上,将主动学习与半监督学习相结合,即在主动学习迭代过程中,将类别一致性高的部分数据样本自动标注,以增加训练数据中标注数据量,保证构建分类器的训练数据量。实验表明了该方法通过部分人工标注,相比半监督学习,能够为PU数据集构建更高准确率的分类器。

关键词: 时间序列, 正例和无标记样本(PU)问题, 分类, 主动学习, 半监督学习

Abstract: Semi-supervised learning is often applied in time series classification based on PU problem, but the boundary data classification is difficult to be accurately labeled in semi-supervised learning method. To resolve the problem, this paper applies the active learning method to build classification of PU problem with a method named OAL(Only Active Learning), which applies active learning to select part of unlabeled data sample, and then labeled with expert manually. To select the most informative data sample to label by expert, it builds a series of classifiers to calculate the difference of an unlabeled data sample, and takes the distribution of the sample into consideration and then applies the amount of information in the data sample as a data selection strategy for active learning. As OAL cannot get enough labeled data set with limit time and expert, it proposes a way based on OAL which combines semi-supervised learning and active learning and labeled sample with high consistency automatically to increase the amount of labeled data in the training data and ensure the quality of training data. Experiments show that the method proposed can construct more accurate classifiers compared to semi supervised learning for PU data set.

Key words: time series, Positive and Unlabled(PU) problem, classification, active learning, semi-supervised learning