Malware Family Classification Based on Deep Learning Visualization

doi:10.3778/j.issn.1002-8331.2007-0291

Abstract

Abstract:

The rapid development of computer network technology has led to an increasing number of malicious software. Aiming at the problem of malware family classification, a method of malware family classification based on deep learning visualization is proposed. In this method, the malware opcodes are converted into gray images that can be viewed directly. By using Recursive Neural Network（RNN） to process opcode sequences, this paper take into account not only the original information of malware, but also the ability to associate the original code with timing characteristics, thus enhancing the information density of the classified features. Then, SimHash is used to generate feature images from the fusion of the original codes and the predictive codes from the RNN. Finally, malicious code images based on the same family are more similar than those of different families. The traditional classification model can’t finish automatic extraction of classification features. To address this problem, this paper uses Convolutional Neural Network（CNN） to classify the feature images. The method has been implemented and tested on a set of 10868 malware instances in 9 families, the classification accuracy achieves 98.8%, and the effective and information-enhanced classification features could be obtained.

Key words: malware family, malicious code visualization, Recursive Neural Network（RNN）, Convolutional Neural Network（CNN）, SimHash

摘要：

计算机网络技术的快速发展，导致恶意软件数量不断增加。针对恶意软件家族分类问题，提出一种基于深度学习可视化的恶意软件家族分类方法。该方法采用恶意软件操作码特征图像生成的方式，将恶意软件操作码转化为可直视的灰度图像。使用递归神经网络处理操作码序列，不仅考虑了恶意软件的原始信息，还考虑了将原始代码与时序特征相关联的能力，增强分类特征的信息密度。利用SimHash将原始编码与递归神经网络的预测编码融合，生成特征图像。基于相同族的恶意代码图像比不同族的具有更明显相似性的现象，针对传统分类模型无法解决自动提取分类特征的问题，使用卷积神经网络对特征图像进行分类。实验部分使用10?868个样本（包含9个恶意家族）对深度学习可视化进行有效性验证，分类精度达到98.8%，且能够获得有效的、信息增强的分类特征。

关键词: 恶意软件家族, 恶意代码可视化, 递归神经网络（RNN）, 卷积神经网络（CNN）, SimHash

CHEN Xiaohan, WEI Shuning, QIN Zhengze. Malware Family Classification Based on Deep Learning Visualization[J]. Computer Engineering and Applications, 2021, 57(22): 131-138.

陈小寒，魏书宁，覃正泽. 基于深度学习可视化的恶意软件家族分类[J]. 计算机工程与应用, 2021, 57(22): 131-138.

References

[1] Rad Hat.什么是恶意软件?[EB/OL].（2019-01-12）[2020-04-24].https：//www.redhat.com/zh/topics/security/what-is-malware.
Rad Hat.What is malware?[EB/OL].（2019-01-12）[2020-04-24].https：//www.redhat.com/zh/topics/security/what-is-malware.
[2] 国家互联网应急中心.网络安全信息与动态周报[EB/OL].（2020-04-23）[2020-04-24].https：//www.cert.org.cn/publish/main/44/2020/20200423151618969661418/20200423151618
969661418_.html.
National Computer Network Emergency Response Technical Team/Coordination Center of China.Weekly report on network security information and trends[EB/OL].（2020-04-23）[2020-04-24].https：//www.cert.org.cn/publish/main/44/2020/20200423151618969661418/20200423151618969661418_.html.
[3] 金炳初，文辉，石志强，等.基于行为路径树的恶意软件分类方法[J].计算机工程与应用，2020，56（11）：98-104.
JIN B C，WEN H，SHI Z Q，et al.Malware classification method based on path tree of behavior[J].Computer Engineering and Applications，2020，56（11）：98-104.
[4] 陈志锋，李清宝，张平，等.基于数据特征的内核恶意软件检测[J].软件学报，2016，27（12）：3172-3191.
CHEN Z F，LI Q B，ZHANG P，et al.Data characteristics-based kernel malware detection[J].Journal of Software，2016，27（12）：3172-3191.
[5] 郭敏，曾颖明，姚金利，等.基于大数据样本的软件行为安全分析[J].信息网络安全，2017，17（9）：153-156.
GUO M，ZENG Y M，YAO J L，et al.The analysis of software behavior security based on big data samples[J].Netinfo Security，2017，17（9）：153-156.
[6] 陈琪，蒋国平，夏玲玲.基于静态结构的恶意代码同源性分析[J].计算机工程与应用，2017，53（14）：93-98.
CHEN Q，JIANG G P，XIA L L.Homology analysis of malware based on function structure[J].Computer Engineering and Applications，2017，53（14）：93-98.
[7] NATARAJ L，KARTHIKEYAN S，JACOB G，et al.Malware images：visualization and automatic classification[C]//8th International Symposium on Visualization for Cyber Security，Pittsburg，July 20，2011.New York：ACM，2011：21-29.
[8] 冯胥睿瑞，刘嘉勇，程芃森.基于特征提取的恶意软件行为及能力分析方法研究[J].信息网络安全，2019，19（12）：72-78.
FENG X R R，LIU J Y，CHENG P G.Analyzing malware behavior and capability related text based on feature extraction[J].Netinfo Security，2019，19（12）：72-78.
[9] TOBIYAMA S，YAMAGUCHI Y，SHIMADA H，et al.Malware detection with deep neural network using process behavior[C]//IEEE 40th Annual Computer Software and Applications Conference，Atlanta，Jun 10-14，2016.Piscataway：IEEE，2016：577-582.
[10] SUN G S，QUAN Q.Deep learning and visualization for identifying malware families[J].IEEE Transactions on Dependable and Secure Computing，2021，18（1）：283-295.
[11] HAN K S，LIM J H，KANG B，et al.Malware analysis using visualized images and entropy graphs[J].International Journal of Information Security，2015，14（1）：1-14.
[12] 刘亚姝，王志海，侯跃然，等.信息密度增强的恶意代码可视化与自动分类方法[J].清华大学学报（自然科学版），2019，59（1）：9-14.
LIU Y S，WANG Z H，HOU Y R，et al.Malware visualization and automatic classification with enhanced information density[J].Chinese Journal of Tsinghua University（Science and Technology），2019，59（1）：9-14.
[13] KOLOSNJAJI B，ZARRAS A，WEBSTER G，et al.Deep learning for classification of malware system call sequences[C]//LNCS 9992：Australasian Joint Conference on Artificial Intelligence，Nov 29，2016.Cham：Springer，2016：137-149.
[14] 赵炳麟，孟曦，韩金，等.基于图结构的恶意代码同源性分析[J].通信学报，2017，38（S2）：86-93.
ZHAO B L，MENG X，HAN J，et al.Homology analysis of malware based on graph[J].Journal on Communications，2017，38（S2）：86-93.
[15] ZHAO Y Z，XU C Y，BO B，et al.MalDeep：a deep learning classification framework against malware variants based on texture visualization[J].Security and Communication Networks，2019（8）：1-11.
[16] 张弛弘，辛阳.基于灰度图的恶意软件检测方法研究[EB/OL].（2019-12-30）[2020-04-24].http：//www.paper.edu.cn/releasepaper/content/201912-125.
ZHANG C H，XIN Y.Research on malware classification based on gray-scale image[EB/OL].（2019-12-30）[2020-04-24].http：//www.paper.edu.cn/releasepaper/content/201912-125.
[17] PASCANU R，STOKES J W，SANOSSIAN H，et al.Malware classification with recurrent networks[C]//2015 IEEE International Conference on Acoustics，Brisbane，Apr 19-24，2015.Piscataway：IEEE，2015：1916-1920.
[18] EUI C R S，DAWN S，REZA M.Recognizing functions in binaries with neural networks[C]//24th USENIX Conference on Security Symposium，Washington，Aug 12-14，2015.Berkeley：USENIX，2015：611-626.
[19] TOBIYAMA S，YAMAGUCHI Y，SHIMADA H，et al.Malware detection with deep neural network using process behavior[C]//2016 IEEE 40th Annual Computer Software and Applications Conference，Atlanta，Jun 10-14，2016.Piscataway：IEEE，2016：577-582.
[20] UDDIN M S，ROY C K，SCHNEIDER K A，et al.On the effectiveness of simHash for detecting near-miss clones in large scale software systems[C]//18th Working Conference on Reverse Engineering，Limerick，Oct 17-20，2011.Piscataway：IEEE，2011：13-22.
[21] 乔延臣.恶意代码同源判断技术研究[D].北京：中国科学院大学，2016.
QIAO Y C.Research on homology judgment technology of malicious code[D].Beijing：University of Chinese Academy of Sciences，2016.
[22] KAGGLE.Microsoft malware classification challenge（big2015）
[DB/OL].（2015）[2020-04-24].https：//www.kaggle.com/c/malware-classification/datxiu.