计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (22): 131-138.DOI: 10.3778/j.issn.1002-8331.2007-0291

• 网络、通信与安全 • 上一篇    下一篇

基于深度学习可视化的恶意软件家族分类

陈小寒,魏书宁,覃正泽   

  1. 1.湖南师范大学 信息科学与工程学院,长沙 410006
    2.国防科技大学 并行与分布处理国防科技重点实验室,长沙 410006
  • 出版日期:2021-11-15 发布日期:2021-11-16

Malware Family Classification Based on Deep Learning Visualization

CHEN Xiaohan, WEI Shuning, QIN Zhengze   

  1. 1.College of Information Science and Engineering, Hunan Normal University, Changsha 410006, China
    2.National Key Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410006, China
  • Online:2021-11-15 Published:2021-11-16

摘要:

计算机网络技术的快速发展,导致恶意软件数量不断增加。针对恶意软件家族分类问题,提出一种基于深度学习可视化的恶意软件家族分类方法。该方法采用恶意软件操作码特征图像生成的方式,将恶意软件操作码转化为可直视的灰度图像。使用递归神经网络处理操作码序列,不仅考虑了恶意软件的原始信息,还考虑了将原始代码与时序特征相关联的能力,增强分类特征的信息密度。利用SimHash将原始编码与递归神经网络的预测编码融合,生成特征图像。基于相同族的恶意代码图像比不同族的具有更明显相似性的现象,针对传统分类模型无法解决自动提取分类特征的问题,使用卷积神经网络对特征图像进行分类。实验部分使用10?868个样本(包含9个恶意家族)对深度学习可视化进行有效性验证,分类精度达到98.8%,且能够获得有效的、信息增强的分类特征。

关键词: 恶意软件家族, 恶意代码可视化, 递归神经网络(RNN), 卷积神经网络(CNN), SimHash

Abstract:

The rapid development of computer network technology has led to an increasing number of malicious software. Aiming at the problem of malware family classification, a method of malware family classification based on deep learning visualization is proposed. In this method, the malware opcodes are converted into gray images that can be viewed directly. By using Recursive Neural Network(RNN) to process opcode sequences, this paper take into account not only the original information of malware, but also the ability to associate the original code with timing characteristics, thus enhancing the information density of the classified features. Then, SimHash is used to generate feature images from the fusion of the original codes and the predictive codes from the RNN. Finally, malicious code images based on the same family are more similar than those of different families. The traditional classification model can’t finish automatic extraction of classification features. To address this problem, this paper uses Convolutional Neural Network(CNN) to classify the feature images. The method has been implemented and tested on a set of 10868 malware instances in 9 families, the classification accuracy achieves 98.8%, and the effective and information-enhanced classification features could be obtained.

Key words: malware family, malicious code visualization, Recursive Neural Network(RNN), Convolutional Neural Network(CNN), SimHash