结合主动学习与置信度投票的集成自训练方法

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (20): 167-171.

结合主动学习与置信度投票的集成自训练方法

黎隽男，吕佳

重庆师范大学计算机与信息科学学院，重庆 401331

出版日期:2016-10-15 发布日期:2016-10-14

Ensemble self-training method based on active learning and confidence voting

LI Junnan, LV Jia

College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China

Online:2016-10-15 Published:2016-10-14

摘要/Abstract

摘要： 基于集成学习的自训练算法是一种半监督算法，不少学者通过集成分类器类别投票或平均置信度的方法选择可靠样本。基于置信度的投票策略倾向选择置信度高的样本或置信度低但投票却一致的样本进行标记，后者这种情形可能会误标记靠近决策边界的样本，而采用异构集成分类器也可能会导致各基分类器对高置信度样本的类别标记不同，从而无法将其有效加入到有标记样本集。提出了结合主动学习与置信度投票策略的集成自训练算法用来解决上述问题。该算法合理调整了投票策略，选择置信度高且投票一致的无标记样本加以标注，同时利用主动学习对投票不一致而置信度较低的样本进行人工标注，以弥补集成自训练学习只关注置信度高的样本，而忽略了置信度低的样本的有用信息的缺陷。在UCI数据集上的对比实验验证了该算法的有效性。

关键词: 集成自训练算法, 主动学习, 加权K最近邻（KNN）, 朴素贝叶斯, 置信度

Abstract: The self-training algorithm based on ensemble learning is a semi-supervised algorithm. Many scholars choose reliable samples by vote or average confidence of ensemble classifiers. Voting strategies based on confidence tend to choose a sample with high confidence or low confidence but unanimous vote of ensemble classifiers. The latter scenario may mistakenly label a sample near the decision boundary. If the heterogeneous ensemble classifiers are used, it may lead to the problem that a sample of high confidence has different labels labeled by ensemble classifiers. Therefore, unlabeled samples labeled by ensemble classifiers can not be effectively added to the training set. An ensemble self-training algorithm based on active learning and confidence voting is proposed to solve the problems above. The algorithm reasonably adjusts the voting strategy, and labels a unlabelled sample with high confidence and unanimous vote of ensemble classifiers. At the same time, the active learning is used to label samples with low confidence and inconsistent voting of ensemble classifiers, so as to compensate for the shortcoming that the ensemble self-training algorithm focuses only on samples with high confidence, while ignoring useful information of samples of low confidence. The effectiveness of the proposed algorithm is verified by a comparative experiment on the UCI data set.

Key words: ensemble self-training, active learning, weighted K Nearest Neighbor（KNN）, naive Bayes, confidence

黎隽男，吕佳. 结合主动学习与置信度投票的集成自训练方法[J]. 计算机工程与应用, 2016, 52(20): 167-171.

LI Junnan, LV Jia. Ensemble self-training method based on active learning and confidence voting[J]. Computer Engineering and Applications, 2016, 52(20): 167-171.

[1]	万琴，朱晓林，陈国泉，肖岳平. 分层联合双边滤波的深度图修复算法研究[J]. 计算机工程与应用, 2021, 57(6): 184-190.
[2]	赵宇，祝义，于巧，陈小颖. 基于分层数据筛选的跨项目缺陷预测方法[J]. 计算机工程与应用, 2021, 57(20): 279-286.
[3]	李杰，李苗，袁细国. 面向新一代测序数据的病原微生物检测算法[J]. 计算机工程与应用, 2021, 57(19): 282-289.
[4]	安葳鹏，程小博，刘雨. Fleiss’ Kappa系数在贝叶斯决策树算法中的应用[J]. 计算机工程与应用, 2020, 56(7): 137-140.
[5]	张岁岁，黄丽霞，王杰，张雪英. 麦克风阵列下互相关函数分类的声源定位[J]. 计算机工程与应用, 2020, 56(4): 128-133.
[6]	谭学敏，郭超. 半监督学习的运动想象脑电信号分类[J]. 计算机工程与应用, 2020, 56(3): 139-145.
[7]	韩素青，成慧雯，王宝丽. 三支决策朴素贝叶斯增量学习算法研究[J]. 计算机工程与应用, 2020, 56(18): 42-49.
[8]	王得雪，林意，陈俊杰. 协同训练算法在滚动轴承故障诊断中的应用[J]. 计算机工程与应用, 2020, 56(12): 273-278.
[9]	赵晓永，王宁宁，王磊. 基于主动学习的离群点集成挖掘方法研究[J]. 计算机工程与应用, 2020, 56(12): 112-117.
[10]	郭皓月，樊重俊. 考虑决策者权重和核心评级的模糊排序方法[J]. 计算机工程与应用, 2020, 56(10): 163-170.
[11]	马建红，张炳斐，张少光，刘双耀. 基于主动MCNN-SCRF的新能源汽车命名实体识别[J]. 计算机工程与应用, 2019, 55(7): 23-29.
[12]	王洋，吴建英，黄金垒，胡浩，刘玉岭. 基于贝叶斯攻击图的网络入侵意图识别方法[J]. 计算机工程与应用, 2019, 55(22): 73-79.
[13]	成悦，李建增，李爱华，褚丽娜. 基于置信度的加权特征融合相关滤波跟踪[J]. 计算机工程与应用, 2019, 55(20): 152-158.
[14]	戴敏. 基于NB分类器重访概率预测的Web缓存替换策略[J]. 计算机工程与应用, 2019, 55(19): 134-140.
[15]	杨承文，李吉明，杨东勇. 基于深度贝叶斯主动学习的高光谱图像分类[J]. 计算机工程与应用, 2019, 55(18): 166-172.