计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (3): 299-308.DOI: 10.3778/j.issn.1002-8331.2305-0484

• 网络、通信与安全 • 上一篇    下一篇

基于异构指令图的恶意软件分类方法研究

钱丽萍,吉晓梅   

  1. 1.北京建筑大学 电气与信息工程学院,北京 100044
    2.北京建筑大学 建筑大数据智能处理方法研究北京市重点实验室,北京 100044
  • 出版日期:2024-02-01 发布日期:2024-02-01

Research on Malware Classification Method Based on Heterogeneous Instruction Graph

QIAN Liping, JI Xiaomei   

  1. 1.School of Electrical and Information Engineering, Beijing Jianzhu University, Beijing 100044, China
    2.Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing Jianzhu University, Beijing 100044, China
  • Online:2024-02-01 Published:2024-02-01

摘要: 恶意软件是当今互联网上最大的安全威胁之一。目前基于图深度学习的恶意软件分类研究未考虑同家族恶意软件的控制流信息所隐含的内在相似性。针对该问题提出了一种基于异构指令图的恶意软件分类方法MCHIG,包括三个阶段:异构指令图生成、结点嵌入和恶意软件分类,首先生成数据集MyHIG,再应用GraphSAGE对不同类型的边分别进行消息传递,完成异构指令图文件结点分类和指令结点嵌入,最后基于控制流图完成恶意软件分类任务。在BIG2015数据集上嵌入阶段的分类精度达到97.81%,分类阶段分别进行了五折和十折交叉验证,其中十折交叉验证的性能更佳,准确度达到99.91%,在BODMAS_mini少样本数据集上,在嵌入阶段达到96.53%,在分类阶段达到98.76%,优于目前较先进的其他恶意软件分类模型。

关键词: 恶意软件分类, 异构指令图, 图深度学习, 控制流图

Abstract: Malware is one of the biggest security threats on the Internet today. At present, research on malware classification based on graph deep learning has not taken into account the inherent similarity hidden in the control flow information of malware families. To solve this problem, a malware classification method based on heterogeneous instruction graph(HIG), MCHIG, is proposed, which includes three stages: HIG generation, node embedding and malware classification. Firstly, the MyHIG dataset is generated. Then GraphSAGE is applied to message different types of edges to complete HIG file nodes classification and instruction nodes embedding. Finally the malware classification task is completed based on control flow graph. The effectiveness is validated on the BIG2015 dataset, achieving a classification accuracy of 97.81% in the embedding stage, and the five-fold and ten-fold cross-validation are carried out in the classification stage, among which the performance of the ten-fold cross-verification is better, and the accuracy rate reaches 99.91%, on the BODMAS_ mini few-sample dataset, it reaches 96.53% in the embedding stage and 98.76% in the classification stage, which is better than other advanced malware classification models.

Key words: malware classification, heterogeneous instruction graph, graph deep learning, control flow graph