计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (15): 147-155.DOI: 10.3778/j.issn.1002-8331.2005-0149

• 网络、通信与安全 • 上一篇    下一篇

基于静态特征融合的恶意软件分类方法

杨春雨,徐洋,张思聪,李小剑   

  1. 贵州师范大学 贵州省信息与计算科学重点实验室,贵阳 550001
  • 出版日期:2021-08-01 发布日期:2021-07-26

Malware Classification Method Based on Fusion of Static Features

YANG Chunyu, XU Yang, ZHANG Sicong, LI Xiaojian   

  1. Key Laboratory of Information and Computer Science of Guizhou Province, Guizhou Normal University, Guiyang 550001, China
  • Online:2021-08-01 Published:2021-07-26

摘要:

针对现有恶意软件分类方法融合的静态特征维度高、特征提取耗时、Boosting算法对大量高维特征样本串行训练时间长的问题,提出一种基于静态特征融合的分类方法。提取原文件和其反编译的Lst文件的灰度图像素特征、原文件的结构特征和Lst文件的内容特征,对特征融合和分类。在训练集采样时启用GOSS算法减少对训练样本的采样,使用LightGBM作为分类器,该分类器通过EFB对互斥特征降维。实验证明在三类特征融合下分类准确率达到了97.04%,通过启用GOSS采样减少了29%的训练时间,在分类效果上,融合的特征优于融合Opcode n-gram的特征,LightGBM优于传统深度学习和机器学习算法。

关键词: 恶意软件, 静态特征, 灰度图, 结构特征, Lst文件, LightGBM

Abstract:

Aiming at the problems that the staticfusion feature of existing malware classification methods is high in dimension, time-consuming feature extraction, and the long training time of the Boosting algorithm for a large number of high-dimensional feature samples, a classification method based on static fusion feature is proposed. The grayscale pixel features of the original file and its decompiled Lst file, the structural features of the original file and the content features of the Lst file are extracted, and then the features are merged and classified. When the training set is sampled, the GOSS algorithm is enabled to reduce the sampling of the training samples, and LightGBM is used as the classifier. The classifier reduces the dimensionality of the exclusive features through EFB. The experiment proves that the classification accuracy rate reaches 97.04% under the fusing of three types feature, and the training time is reduced by 29% by enabling GOSS sampling. The fusion feature is better than the feature fusing Opcode n-gram in classification, and LightGBM has better classification effect than traditional deep learning and machine learning algorithms.

Key words: malware, static features, grayscale image, structural features, Lst file, LightGBM