计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (20): 167-171.

• 模式识别与人工智能 • 上一篇    下一篇

结合主动学习与置信度投票的集成自训练方法

黎隽男,吕  佳   

  1. 重庆师范大学 计算机与信息科学学院,重庆 401331
  • 出版日期:2016-10-15 发布日期:2016-10-14

Ensemble self-training method based on active learning and confidence voting

LI Junnan, LV Jia   

  1. College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
  • Online:2016-10-15 Published:2016-10-14

摘要: 基于集成学习的自训练算法是一种半监督算法,不少学者通过集成分类器类别投票或平均置信度的方法选择可靠样本。基于置信度的投票策略倾向选择置信度高的样本或置信度低但投票却一致的样本进行标记,后者这种情形可能会误标记靠近决策边界的样本,而采用异构集成分类器也可能会导致各基分类器对高置信度样本的类别标记不同,从而无法将其有效加入到有标记样本集。提出了结合主动学习与置信度投票策略的集成自训练算法用来解决上述问题。该算法合理调整了投票策略,选择置信度高且投票一致的无标记样本加以标注,同时利用主动学习对投票不一致而置信度较低的样本进行人工标注,以弥补集成自训练学习只关注置信度高的样本,而忽略了置信度低的样本的有用信息的缺陷。在UCI数据集上的对比实验验证了该算法的有效性。

关键词: 集成自训练算法, 主动学习, 加权K最近邻(KNN), 朴素贝叶斯, 置信度

Abstract: The self-training algorithm based on ensemble learning is a semi-supervised algorithm. Many scholars choose reliable samples by vote or average confidence of ensemble classifiers. Voting strategies based on confidence tend to choose a sample with high confidence or low confidence but unanimous vote of ensemble classifiers. The latter scenario may mistakenly label a sample near the decision boundary. If the heterogeneous ensemble classifiers are used, it may lead to the problem that a sample of high confidence has different labels labeled by ensemble classifiers. Therefore, unlabeled samples labeled by ensemble classifiers can not be effectively added to the training set. An ensemble self-training algorithm based on active learning and confidence voting is proposed to solve the problems above. The algorithm reasonably adjusts the voting strategy, and labels a unlabelled sample with high confidence and unanimous vote of ensemble classifiers. At the same time, the active learning is used to label samples with low confidence and inconsistent voting of ensemble classifiers, so as to compensate for the shortcoming that the ensemble self-training algorithm focuses only on samples with high confidence, while ignoring useful information of samples of low confidence. The effectiveness of the proposed algorithm is verified by a comparative experiment on the UCI data set.

Key words: ensemble self-training, active learning, weighted K Nearest Neighbor(KNN), naive Bayes, confidence