Research of Outlier Ensemble Mining Based on Active Learning

doi:10.3778/j.issn.1002-8331.1903-0008

Abstract

Abstract:

Outlier detection tasks usually lack available labeled data, and outlier data only accounts for a small part of the entire data set. Compared to other data mining tasks, outlier detection is more difficult, and there is no single algorithm suitable for all the scenes. Therefore, combined with the idea of diversity model ensemble and active learning, this paper proposes an outlier ensemble detection method named Outlier Mining based on Active Learning（OMAL）. Under the guidance of the active learning framework, five unsupervised models based on statistics, similarity and axis-parallel subspace are selected as the base learners according to the comparative analysis of various learners. Then, the outlier and normal boundary data of each base learner are integrated, filtered and presented to the human experts for labeling to maximize information feedback from the human experts. Sampling from the labeled dataset and the dataset generated by the voting of the base learners. A supervised binary classification model based on Gradient Boosting Machine（GBM） is trained and applied to the full dataset to mining the final results. Experiments show that the AUC of OMAL method has been significantly improved while providing good performance and practical value.

Key words: outlier detection, active learning, model ensemble

摘要：

离群点检测任务通常缺少可用的标注数据，且离群数据只占整个数据集的很小一部分，相较于其他的数据挖掘任务，离群点检测的难度较大，尚没有单一的算法适合于所有的场景。因此，结合多样性模型集成和主动学习思想，提出了一种基于主动学习的离群点集成检测方法OMAL（Outlier Mining based on Active Learning）。在主动学习框架指导下，根据各种基学习器的对比分析，选择了基于统计的、基于相似性的、基于子空间划分的三个无监督模型作为基学习器。将各基学习器评判的处于离群和正常边界的数据整合后呈现给人类专家进行标注，以最大化人类专家反馈的信息量；从标注的数据集和各基学习器投票产生的数据集中抽样，基于GBM（Gradient BoostingMachine）训练一个有监督二元分类模型，并将该模型应用于全数据集，得出最终的挖掘结果。实验表明，提出方法的AUC有了较为明显的提升，且具有良好的运行效率，具备较好的实用价值。

关键词: 离群检测, 主动学习, 模型集成

ZHAO Xiaoyong, WANG Ningning, WANG Lei. Research of Outlier Ensemble Mining Based on Active Learning[J]. Computer Engineering and Applications, 2020, 56(12): 112-117.

赵晓永，王宁宁，王磊. 基于主动学习的离群点集成挖掘方法研究[J]. 计算机工程与应用, 2020, 56(12): 112-117.

[1]	MA Yang, ZHAO Xujun. Multi-source Outlier Detection Algorithm Based on Relevant Subspace [J]. Computer Engineering and Applications, 2021, 57(17): 88-95.
[2]	ZHOU Yu, ZHU Wenhao, FANG Qian, BAI Lei. Survey of Outlier Detection Methods Based on Clustering [J]. Computer Engineering and Applications, 2021, 57(12): 37-45.
[3]	ZHANG Hainan, YOU Xiaoming, LIU Sheng, LIU Zhongqiang. Interactive Learning Cuckoo Search Algorithm [J]. Computer Engineering and Applications, 2020, 56(7): 147-154.
[4]	HE Huanye, LIN Guoyuan, GU Hao, FANG Menghua. Improved LOF Algorithm in Cloud Virtual Machine Anomaly Detection Scenario [J]. Computer Engineering and Applications, 2020, 56(23): 80-86.
[5]	QIN Fengting, YANG Youlong, QIU Haiquan. Sparse Subspace-Based Method for Local Outlier Detection [J]. Computer Engineering and Applications, 2020, 56(19): 152-159.
[6]	MA Jianhong, ZHANG Bingfei, ZHANG Shaoguang, LIU Shuangyao. Named Entity Recognition for New Energy Vehicles Based on Active MCNN-SCRF [J]. Computer Engineering and Applications, 2019, 55(7): 23-29.
[7]	ZHONG Yuling, WANG Xite, BAI Mei, ZHU Bin, LI Guanyu. FODU：Fast Outlier Detection Approach on Uncertain Data Sets [J]. Computer Engineering and Applications, 2019, 55(19): 105-114.
[8]	YANG Chengwen, LI Jiming, YANG Dongyong. Active Learning for Hyperspectral Image Classification with Deep Bayesian [J]. Computer Engineering and Applications, 2019, 55(18): 166-172.
[9]	ZHAO Yue, LI Yaoqiang, XU Xiaona, WU Licheng. Near-optimal active learning for Tibetan speech recognition [J]. Computer Engineering and Applications, 2018, 54(22): 156-159.
[10]	YAO Qiong1, XU Xiang1，2, ZOU Kun1. 3D Gabor based multi-view active learning for hyperspectral image classification [J]. Computer Engineering and Applications, 2018, 54(22): 197-204.
[11]	LIU Yanfei, HE Yanhui, ZHANG Wei, CUI Zhigao. Research on KCF target loss early warning method based on outlier detection [J]. Computer Engineering and Applications, 2018, 54(22): 216-222.
[12]	LIU Yanfei, HE Yanhui, JIANG Ke, ZHANG Wei. Improved KCF tracking algorithm using outlier detection and relocation [J]. Computer Engineering and Applications, 2018, 54(20): 166-171.
[13]	CHEN Juan1, ZHU Fuxi1，2. Time series classification based on PU problem with semi-supervised learning and active learning [J]. Computer Engineering and Applications, 2018, 54(11): 116-121.
[14]	HAN Chong1, YUAN Yingshan2, MEI Tao2, GENG Huiling2. Data stream outlier detection algorithm based on K-means [J]. Computer Engineering and Applications, 2017, 53(3): 58-63.
[15]	ZHAO Pengfei, ZHOU Shaoguang, YI Yang, HU Yiqun. Classification method of hyperspectral remote sensing image based on SLIC and active learning [J]. Computer Engineering and Applications, 2017, 53(3): 183-187.

Research of Outlier Ensemble Mining Based on Active Learning

基于主动学习的离群点集成挖掘方法研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics