Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (12): 112-117.DOI: 10.3778/j.issn.1002-8331.1903-0008

Previous Articles     Next Articles

Research of Outlier Ensemble Mining Based on Active Learning

ZHAO Xiaoyong, WANG Ningning, WANG Lei   

  1. School of Information and Management, Beijing Information Science & Technology University, Beijing 100129, China
  • Online:2020-06-15 Published:2020-06-09

基于主动学习的离群点集成挖掘方法研究

赵晓永,王宁宁,王磊   

  1. 北京信息科技大学 信息管理学院,北京 100129

Abstract:

Outlier detection tasks usually lack available labeled data, and outlier data only accounts for a small part of the entire data set. Compared to other data mining tasks, outlier detection is more difficult, and there is no single algorithm suitable for all the scenes. Therefore, combined with the idea of diversity model ensemble and active learning, this paper proposes an outlier ensemble detection method named Outlier Mining based on Active Learning(OMAL). Under the guidance of the active learning framework, five unsupervised models based on statistics, similarity and axis-parallel subspace are selected as the base learners according to the comparative analysis of various learners. Then, the outlier and normal boundary data of each base learner are integrated, filtered and presented to the human experts for labeling to maximize information feedback from the human experts. Sampling from the labeled dataset and the dataset generated by the voting of the base learners. A supervised binary classification model based on Gradient Boosting Machine(GBM) is trained and applied to the full dataset to mining the final results. Experiments show that the AUC of OMAL method has been significantly improved while providing good performance and practical value.

Key words: outlier detection, active learning, model ensemble

摘要:

离群点检测任务通常缺少可用的标注数据,且离群数据只占整个数据集的很小一部分,相较于其他的数据挖掘任务,离群点检测的难度较大,尚没有单一的算法适合于所有的场景。因此,结合多样性模型集成和主动学习思想,提出了一种基于主动学习的离群点集成检测方法OMAL(Outlier Mining based on Active Learning)。在主动学习框架指导下,根据各种基学习器的对比分析,选择了基于统计的、基于相似性的、基于子空间划分的三个无监督模型作为基学习器。将各基学习器评判的处于离群和正常边界的数据整合后呈现给人类专家进行标注,以最大化人类专家反馈的信息量;从标注的数据集和各基学习器投票产生的数据集中抽样,基于GBM(Gradient BoostingMachine)训练一个有监督二元分类模型,并将该模型应用于全数据集,得出最终的挖掘结果。实验表明,提出方法的AUC有了较为明显的提升,且具有良好的运行效率,具备较好的实用价值。

关键词: 离群检测, 主动学习, 模型集成