Study on Generative Adversarial Network for Data Anomaly Detection

doi:10.3778/j.issn.1002-8331.2008-0354

Abstract

Abstract: Many detection models cannot effectively detect because of the class-imbalanced data and the complexity of anomaly data, this paper proposes a novel model for data anomaly detection using generative adversarial network（GAN）. The model first utilizes InfoGAN network to generate class-balanced samples, then builds an inference network that can be treated as label generator to predict realistic sample, the inference network is tuned using the second GAN which guarantees the consistency between the generated samples and the corresponding labels. The inference network is further optimized by adopting random forest to solve classification on generated data-label pair whose best hyperparameters are searched via Hyperband algorithm at last. The model is compared with five machine learning models and four real datasets, the result demonstrates that the proposed model can make effectively classification for anomaly data but need not collect more failure data, the model achieves 0.14 improvement from the K-nearest neighbor（KNN） model in terms of AUC value in the Mnist datasets, and outperforms any other traditional machine learning models.

Key words: data anomaly detection, InfoGAN, random forest, Hyperband

摘要： 针对许多检测模型受到数据不平衡和异常数据的复杂性等因素影响问题，提出一种以生成对抗网络（generative adversarial network，GAN）为基础的数据异常检测方法。该方法利用InfoGAN网络训练生成正常数据和异常数据，构造一个推理神经网络作为生成数据与原始数据的标签生成器，之后利用第二个GAN网络对推理网络精调，保证生成的样本和其标签对应；最后将生成样本与标签输入随机森林分类，通过Hyperband算法寻找随机森林最优超参，对推理网络进一步优化。在四个真实数据集上与五种传统机器学习模型进行实验对比，实验结果表明，该模型无需收集更多异常样本，达到数据平衡就可以有效进行数据异常检测。在Mnist数据集中，该模型的AUC值相比于K近邻（K-nearest neighbor，KNN）方法提高0.14，并且综合性能优于传统机器学习模型。

关键词: 数据异常检测, InfoGAN, 随机森林, Hyperband

ZHUANG Yuesheng, LIN Shanling, LIN Zhixian, ZHANG Yongai, GUO Tailiang. Study on Generative Adversarial Network for Data Anomaly Detection[J]. Computer Engineering and Applications, 2022, 58(4): 143-149.

庄跃生, 林珊玲, 林志贤, 张永爱, 郭太良. 生成对抗网络在数据异常检测中的研究[J]. 计算机工程与应用, 2022, 58(4): 143-149.

References

[1] 黎佳玥，赵波，李想，等.基于深度学习的网络流量异常预测方法[J].计算机工程与应用，2020，56（6）：39-50.
LI J Y，ZHAO B，LI X，et al.Network traffic anomaly prediction method based on deep learning[J].Computer Engineering and Applications，2020，56（6）：39-50.
[2] 赵曼.基于数据相关性的异常检测算法研究[D].北京：北京交通大学，2017.
ZHAO M.Outlier detection based on data correlation[D].Beijing：Beijing Jiaotong University，2017.
[3] 赵贵成.基于工业控制网络流量分析的入侵检测平台与算法研究[D].杭州：浙江大学，2019.
ZHAO G C.Research on intrusion detection platform and algorithm based on industrial control network traffic analysis[D].Hangzhou：Zhejiang University，2019.
[4] 梁伟.面向金融数据的异常检测方法研究[D].南昌：南昌大学，2019.
LIANG W.Research on anomaly detection methods for financial data[D].Nanchang：Nanchang University，2019.
[5] 黄润.电力系统异常检测与分类研究[D].成都：电子科技大学，2020.
HUANG R.Research on abnormal detection and classification of power system[D].Chengdu：University of Electronic Science and Technology of China，2020.
[6] 蒋华，江日辰，王鑫，等.ADASYN和SMOTE相结合的不平衡数据分类算法[J].计算机仿真，2020，37（3）：254-258.
JIANG H，JIANG R C，WANG X，et al.Unbalanced data classification algorithm based on combination of ADASYN and SMOTE[J].Compute Simulation，2020，37（3）：254-258.
[7] 吴磊，房斌，刁丽萍，等.融合过抽样和欠抽样的不平衡数据重抽样方法[J].计算机工程与应用，2013，49（21）：172-176.
WU L，FANG B，DIAO L P，et al.Imbalanced data resampling based on oversampling and under-sampling[J].Computer Engineering and Applications，2013，49（21）：172-176.
[8] 任家东，刘新倩，王倩，等.基于KNN离群点检测和随机森林的多层入侵检测方法[J].计算机研究与发展，2019，56（3）：566-575.
REN J D，LIU X Q，WANG Q，et al.An multi-level intrusion detection method based on KNN outlier detection and random forests[J].Journal of Computer Research and Development，2019，56（3）：566-575.
[9] 徐东，王岩俊，孟宇龙，等.基于Isolation Forest改进的数据异常检测方法[J].计算机科学，2018，45（10）：155-159.
XUD，WANG Y J，MENG Y L，et al.Improved data anomaly detection method based on isolation forest[J].Computer Science，2018，45（10）：155-159.
[10] SCHLEGEL T，SEEB?CK P，WALDSTEIN S M，et al.Unsupervised anomaly detection with generative adversarial networks to guide marker discovery[C]//International Conference on Information Processing in Medical Imaging.Cham：Springer，2017：146-157.
[11] FRID-ADAR M，DIAMANT I，KLANG E，et al.GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification[J].Neurocomputing，2018，321：321-331.
[12] CHALAPATHY R，CHAWLA S.Deep learning for anomaly detection：a survey[J].arXiv：1901.03407，2019.
[13] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014：2672-2680.
[14] CHEN X，DUAN Y，HOUTHOOFT R，et al.InfoGAN：interpretable representation learning by information maximizing generative adversarial nets[C]//Advances in Neural Information Processing Systems，2016：2172-2180.
[15] BARBER D，AGAKOV F V.The IM algorithm：a variational approach to information maximization[C]//Advances in Neural Information Processing Systems，2003.
[16] 芶继军，李均华，陈晨，等.基于随机森林的网络入侵检测方法[J].计算机工程与应用，2020，56（2）：82-88.
GOU J J，LI J H，CHEN C，et al.Network intrusion detection method based on random forest[J].Computer Engineering and Applications，2020，56（2）：82-88.
[17] LI L，JAMIESON K，DESALVO G，et al.Hyperband：a novel bandit-based approach to hyperparameter optimization[J].The Journal of Machine Learning Research，2017，18（1）：6765-6816.
[18] 王彩文，杨有龙.针对不平衡数据的改进的近邻分类算法[J].计算机工程与应用，2020，56（7）：30-38.
WANG C W，YANG Y L.Improved nearest neighbor classification algorithm for imbalanced data[J].Computer Engineering and Applications，2020，56（7）：30-38.