基于异构指令图的恶意软件分类方法研究

doi:10.3778/j.issn.1002-8331.2305-0484

摘要/Abstract

摘要： 恶意软件是当今互联网上最大的安全威胁之一。目前基于图深度学习的恶意软件分类研究未考虑同家族恶意软件的控制流信息所隐含的内在相似性。针对该问题提出了一种基于异构指令图的恶意软件分类方法MCHIG，包括三个阶段：异构指令图生成、结点嵌入和恶意软件分类，首先生成数据集MyHIG，再应用GraphSAGE对不同类型的边分别进行消息传递，完成异构指令图文件结点分类和指令结点嵌入，最后基于控制流图完成恶意软件分类任务。在BIG2015数据集上嵌入阶段的分类精度达到97.81%，分类阶段分别进行了五折和十折交叉验证，其中十折交叉验证的性能更佳，准确度达到99.91%，在BODMAS_mini少样本数据集上，在嵌入阶段达到96.53%，在分类阶段达到98.76%，优于目前较先进的其他恶意软件分类模型。

关键词: 恶意软件分类, 异构指令图, 图深度学习, 控制流图

Abstract: Malware is one of the biggest security threats on the Internet today. At present, research on malware classification based on graph deep learning has not taken into account the inherent similarity hidden in the control flow information of malware families. To solve this problem, a malware classification method based on heterogeneous instruction graph(HIG), MCHIG, is proposed, which includes three stages: HIG generation, node embedding and malware classification. Firstly, the MyHIG dataset is generated. Then GraphSAGE is applied to message different types of edges to complete HIG file nodes classification and instruction nodes embedding. Finally the malware classification task is completed based on control flow graph. The effectiveness is validated on the BIG2015 dataset, achieving a classification accuracy of 97.81% in the embedding stage, and the five-fold and ten-fold cross-validation are carried out in the classification stage, among which the performance of the ten-fold cross-verification is better, and the accuracy rate reaches 99.91%, on the BODMAS_ mini few-sample dataset, it reaches 96.53% in the embedding stage and 98.76% in the classification stage, which is better than other advanced malware classification models.

Key words: malware classification, heterogeneous instruction graph, graph deep learning, control flow graph

钱丽萍, 吉晓梅. 基于异构指令图的恶意软件分类方法研究[J]. 计算机工程与应用, 2024, 60(3): 299-308.

QIAN Liping, JI Xiaomei. Research on Malware Classification Method Based on Heterogeneous Instruction Graph[J]. Computer Engineering and Applications, 2024, 60(3): 299-308.

参考文献

[1] THE AV-TEST INSTITUTE. “Malware statistics,” AV-ATLAS-Malware & PUA[EB/OL]. [2023]. https://www.av-test.org/en/statistics/malware/.
[2] 刘霞. 三大网络安全威胁持续频发[N/OL]. 科技日报, 2022?11?09. http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/html/2022-11/09/content_544129.htm.
LIU X. The three major cybersecurity threats continue to emerge[N/OL]. Science and Technology Daily, 2022-11-09. http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/html/2022-11/09/content_544129. htm.
[3] LI C, LV Q, LI N, et al. A novel deep framework for dynamic malware detection based on API sequence intrinsic features[J]. Computers & Security, 2022, 116: 102686.
[4] LIU J, SHEN Y, YAN H. Functions-based CFG embedding for malware homology analysis[C]//2019 26th International Conference on Telecommunications (ICT), 2019: 220-226.
[5] YAN J, YAN G, JIN D. Classifying malware represented as control flow graphs using deep graph convolutional neural network[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2019: 52-63.
[6] XIA R, CUI B. Malware classification based on graph neural network using control flow graph[C]//Proceedings of the 16th International Conference on Broad-Band Wireless Computing, Communication and Applications (BWCCA-2021), 2022: 129-138.
[7] WU B, XU Y, ZOU F. Malware classification by learning semantic and structural features of control flow graphs[C]//2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2021: 540-547.
[8] WU X W, WANG Y, FANG Y, et al. Embedding vector generation based on function call graph for effective malware detection and classification[J]. Neural Computing and Applications, 2022, 34(11): 8643-8656.
[9] GAO Y, HASEGAWA H, YAMAGUCHI Y, et al. Malware detection by control-flow graph level representation learning with graph isomorphism network[J]. IEEE Access, 2022, 10: 111830-111841.
[10] WANG S, ZHAO Y, LIU G, et al. A hierarchical graph-based neural network for malware classification[C]//28th International Conference on Neural Information Processing, (ICONIP 2021), Sanur, Bali, Indonesia, December 8-12, 2021: 621-633.
[11] LING X, WU L, DENG W, et al. MalGraph: hierarchical graph neural networks for robust windows malware detection[C]//IEEE Conference on Computer Communications, 2022: 1998-2007.
[12] ANDERSON H S, ROTH P. EMBER: an open dataset for training static PE malware machine learning models[J]. arXiv:1804.04637, 2018.
[13] RAFF E, BARKER J, SYLVESTER J, et al. Malware detection by eating a whole exe[J]. arXiv:1710.09435, 2017.
[14] ZHANG X, PANG J, LIU X. Common program similarity metric method for anti-obfuscation[J]. IEEE Access, 2018, 6: 47557-47565.
[15] TANG K, SHAN Z, ZHANG C, et al. DFSGraph: data flow semantic model for intermediate representation programs based on graph network[J]. Electronics, 2022, 11(19): 3230.
[16] NGUYEN D Q, TONG V, PHUNG D, et al. Node co-occurrence based graph neural networks for knowledge graph link prediction[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022: 1589-1592.
[17] WANG Y, WANG C, ZHAN J, et al. Text FCG: fusing contextual information via graph learning for text classification[J]. Expert Systems with Applications, 2023: 119658.
[18] 郭晓旺, 夏鸿斌, 刘渊. 融合知识图谱与图卷积网络的混合推荐模型[J]. 计算机科学与探索, 2022, 16(6): 1343-1353.
GUO X W, XIA H B, LIU Y. Hybrid recommendation model of knowledge graph and graph convolutional network[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1343-1353.
[19] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[J]. arXiv:1609.02907, 2016.
[20] LI G, MULLER M, THABET A, et al. Deepgcns: can gcns go as deep as cnns?[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 9267-9276.
[21] HAMILTON W, YING Z, LESKOVEC J. Inductive representation learning on large graphs[C]//Advances in Neural Information Processing Systems, 2017.
[22] 张雪涛, 王金双, 孙蒙. 基于 GCN 的安卓恶意软件检测模型[J]. 软件导刊, 2020, 19(7): 187-193.
ZHANG X T, WANG J S, SUN M. GCN-based android malware detection model[J]. Software Guide, 2020, 19(7): 187-193.
[23] RONEN R, RADU M, FEUERSTEIN C, et al. Microsoft malware classification challenge[J]. arXiv:1802.10135, 2018.
[24] YANG L, CIPTADI A, LAZIUK I, et al. BODMAS: an open dataset for learning based temporal analysis of PE malware[C]//2021 IEEE Security and Privacy Workshops (SPW), 2021: 78-84.
[25] GIBERT D, MATEU C, PLANES J, et al. Using convolutional neural networks for classification of malware represented as images[J]. Journal of Computer Virology and Hacking Techniques, 2019, 15: 15-28.
[26] MAYS M, DRABINSKY N, BRANDLE S. Feature selection for malware classification[C]//Modern Artificial Intelligence and Cognitive Science Conference, 2017: 165-170.
[27] GIBERT D, MATEU C, PLANES J. HYDRA: a multimodal deep learning framework for malware classification[J]. Computers & Security, 2020, 95: 101873.