Self-Supervised Tabular Data Anomaly Detection Method Based on Knowledge Enhancement

doi:10.3778/j.issn.1002-8331.2301-0087

Abstract

Abstract: The traditional supervised anomaly detection methods have developed rapidly. In order to reduce the dependence on labels, self-supervised pre-training methods are widely studied, and the studies show that additional intrinsic semantic knowledge embedding is crucial for table learning. In order to mine the rich knowledge information in tabular data, the self-supervised tabular data anomaly detection method based on knowledge enhancement (STKE) is proposed with the following improvements. The proposed data processing module integrates domain knowledge (semantics) and statistical mathematics knowledge into feature construction. At the same time, self-supervised pre-training (parameter learning) provides contextual knowledge priors to achieve the rich information transfer of tabular data. The mask mechanism is used on the original data to learn the masked features by learning the relevant non-masked features, and predict the original value of the additive Gaussian noise in the hidden layer space of the data. This strategy promotes the model even in the presence of noisy inputs. The original feature information can also be recovered. A hybrid attention mechanism is used to effectively extract association information between data features. The experimental results of the proposed method on six datasets show superior performance.

Key words: anomaly detection, self-supervised, knowledge enhancement, pre-training

摘要： 传统的监督异常检测方法快速发展，为了减少对标签的依赖，自监督预训练方法得到了广泛的研究，同时研究表明额外的内在语义知识嵌入对于表格学习至关重要。为了挖掘表格数据当中存在的丰富知识信息，提出了一种基于知识增强的自监督表格数据异常检测方法（self-supervised tabular data anomaly detection method based on knowledge enhancement，STKE）并进行了改进。提出的数据处理模块将领域知识（语义）、统计数学知识融入到特征构建中，同时自监督预训练（参数学习）提供上下文知识先验，实现表格数据的丰富信息迁移。在原始数据上采用mask机制，通过学习相关的非遮掩特征来学习遮掩特征，同时预测在数据隐层空间加性高斯噪声的原始值。该策略促使模型即使在有噪声输入的情况下也能恢复原始的特征信息。使用混合注意机制有效提取数据特征之间的关联信息。在6个数据集上的实验结果展现了提出的方法优越的性能。

关键词: 异常检测, 自监督, 知识增强, 预训练

GAO Xiaoyu, ZHAO Xiaoyong, WANG Lei. Self-Supervised Tabular Data Anomaly Detection Method Based on Knowledge Enhancement[J]. Computer Engineering and Applications, 2024, 60(10): 140-147.

高小玉, 赵晓永, 王磊. 知识增强的自监督表格数据异常检测方法研究[J]. 计算机工程与应用, 2024, 60(10): 140-147.

References

[1] 赖英旭, 刘增辉, 蔡晓田, 等. 工业控制系统入侵检测研究综述[J]. 通信学报, 2017, 38(2): 143-156.
LAI Y X, LIU Z H, CAI X T, et al. Research on instrusion detection of industrial control system[J]. Journal on Communications, 2017, 38(2): 143-156.
[2] 刘斐. 面向网络欺诈行为发现的不确定数据离群点检测算法研究[D]. 长沙: 国防科学技术大学, 2016.
LIU F. The research on uncertain outlier detection algorithm for internet fraud detection[D]. Changsha: National University of Defense Technology, 2016.
[3] HYANG X, KHETAN A, CVIYKOVIC M, et al. TabTransformer: tabular data modeling using contextual embeddings[J]. arXiv:2012.06678, 2020.
[4] 周志华. 基于分歧的半监督学习[J]. 自动化学报, 2013, 39(11): 1871-1878.
ZHOU Z H. Disagreement-based semi-supervised learning[J]. Acta Automatica Sinica, 2013, 39 (11): 1871-1878.
[5] 申栩林, 李超波, 李洪均. 人群密集度下GAN的视频异常行为检测进展[J]. 计算机工程与应用, 2022, 58(7): 21-30.
SHEN X L, LI C B, LI H J. Overview on video abnormal behavior detection of GAN via human density[J]. Computer Engineering and Applications, 2022, 58(7): 21-30.
[6] SYED T, MIRZA B. Self-supervision for tabular data by learning to predict additive Gaussian noise as pretext[J]. ACM Transactions on Knowledge Discovery from Data, 2021, 17(9): 122.
[7] YOON J, ZHANG Y, JORDON J, et al. VIME: extending the success of self- and semi-supervised learning to tabular domain[C]//Advances in Neural Information Processing Systems 33, 2020: 11033-11043.
[8] BERTHELOT D, CARLINI N, GOODFELLOW I, et al. MixMatch: a holistic approach to semi-supervised learning[C]//Advances in Neural Information Processing Systems 32, 2019: 5049-5059.
[9] ZHANG H, CISSE M, DAUPHIN Y N, et al. mixup: beyond empirical risk minimization[J]. arXiv:1710.09412, 2017.
[10] GRINSZTAJN L, OYALLON E, VAROQUAUX G. Why do tree-based models still outperform deep learning on tabular data?[J]. arXiv:2207.08815, 2022.
[11] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 1597-1607.
[12] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the 25th International Conference Machine Learning, Helsinki, Jun 5-9, 2008.
[13] DONG H, CHENG Z, HE X, et al. Table pretraining: a survey on model architectures, pretraining objectives, and downstream tasks[J]. arXiv:2201.09745, 2022.
[14] IIDA H, THAI D, MANJUNATHA V, et al. Tabbie: pretrained representations of tabular data[J]. arXiv:2105.02584, 2021.
[15] WANG Z, SUN J. TransTab: learning transferable tabular transformers across tables[J]. arXiv:2205.09328, 2022.
[16] YOU Z, CUI L, SHEN Y, et al. A unified model for multi-class anomaly detection[J]. arXiv:2206.03687, 2022.
[17] LIU F T, TING K M, ZHOU Z H. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1-39.
[18] CHEN T, GUESTRIN C. XGBoost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016: 785-794.
[19] KE G, MENG Q, FINLEY T, et al. LightGBM: a highly efficient gradient boosting decision tree[C]//Advances in Neural Information Processing Systems 30, 2017: 3146-3154.
[20] COLLIN A S, VLEESCHOUWER C D. Improved anomaly detection by training an autoencoder with skip connections on images corrupted with stain-shaped noise[J]. arXiv: 2008.12977, 2020.
[21] KINGMA D P, WELLING M. Auto-encoding variational Bayes[C]//Proceedings of the 2nd International Conference on Learning Representations, 2014.
[22] LIU W, LI R, ZHENG M, et al. Towards visually explaining variational autoencoders[C]//Proceedings of the 2020 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020.
[23] AKCAY S, ATAPOUR-ABARGHOUEI A, BRECKON T P. GANomaly: semi-supervised anomaly detection via adversarial training[C]//Proceedings of the 14th Asian Conference on Computer Vision, 2018: 622-637.
[24] SHI Y, YANG J, QI Z Q. Unsupervised anomaly segmentation via deep feature reconstruction[J]. Neurocomputing, 2021, 424: 9-22.
[25] XIA Y, ZHANG Y, LIU F, et al. Synthesize then compare: detecting failures and anomalies for semantic segmentation[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020.
[26] GONG D, LIU L, LE V, et al. Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019.
[27] PARK H, NOH J, HAM B. Learning memory-guided normality for anomaly detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 14372-14381.
[28] DEHAENE D, FRIGO O, COMBREXELLE S, et al. Iterative energy-based projection on a normal data manifold for anomaly localization[J]. arXiv:2002.03734, 2020.
[29] ARIK S O, PFISTER T. TabNet: attentive interpretable tabular learning[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 6679-6687.
[30] BAHRI D, JIANG H, YI T, et al. SCARF: self-supervised contrastive learning using random feature corruption[J]. arXiv:2106.15147, 2021.
[31] UCAR T, HAJIRAMEZANALI E, EDWARDS L. SubTab: subsetting features of tabular data for self-supervised representation learning[C]//Advances in Neural Information Processing Systems 34, 2021.
[32] LIN X V, SOCHER R, XIONG C. Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing[C]//Findings of the Association for Computational Linguistics: EMNLP 2020, 2020 : 4870-4888.
[33] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceeding of the 31st Conference on Neural Information Processing Systems，Long Beach， Dec 4- 9，2017. Cambridge：MIT Press， 2017: 5998-6008.
[34] DENG X, SUN H, LEES A, et al. TURL: table understanding through representation learning[J]. SIGMOD Record, 2022, 51(1): 33-40.
[35] WANG Z, DONG H, JIA R, et al. Tuta: tree-based transformers for generally structured table pre-training[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021: 1780-1790.
[36] YIN P, NEU G, YIH W, et al. TaBERT: pretraining for joint understanding of textual and tabular data[J]. arXiv:2005. 08314, 2020.
[37] HERZIG J, NOWAK P K, MVLLER T, et al. TaPas: weakly supervised table parsing via pre-training[J]. arXiv:2004. 02349, 2020.
[38] GUO M, ZHANG Y, LIU T. Gaussian transformer: a lightweight approach for natural language inference[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019: 6489-6496.
[39] BENGIO Y, YAO L, ALAIN G, et al. Generalized denoising auto-encoders as generative models[C]//Advances in Neural Information Processing Systems 26, 2013.
[40] GORISHNIY Y, RUBACHEV I, KHRULKOV V, et al. Revisiting deep learning models for tabular data[C]//Advances in Neural Information Processing Systems 34, 2021.
[41] SOMEPALLI G, GOLDBLUN M, SCHWARZSCHILD A, et al. SAINT: improved neural networks for tabular data via row attention and contrastive pre-training[J]. arXiv:2106. 01342, 2021.