计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (11): 90-97.DOI: 10.3778/j.issn.1002-8331.1911-0132

• 网络、通信与安全 • 上一篇    下一篇

网页内容链接层次语义树的恶意网页检测方法

陈本刚,宋礼鹏   

  1. 中北大学 大数据学院 大数据与网络安全研究所,太原 030051
  • 出版日期:2020-06-01 发布日期:2020-06-01

Malicious Webpage Detection Method for Webpage Content Link Hierarchy Semantic Tree

CHEN Bengang, SONG Lipeng   

  1. Research Institute of Big Data and Network Security, School of Big Data, North University of China, Taiyuan 030051, China
  • Online:2020-06-01 Published:2020-06-01

摘要:

针对攻击者利用URL缩短服务导致仅依赖于URL特征的恶意网页检测失效的问题,及恶意网页检测中恶意与良性网页高度不均衡的问题,提出一种融合网页内容层次语义树特征的成本敏感学习的恶意网页检测方法。该方法通过构建网页内容链接层次语义树,提取基于语义树的特征,解决了URL缩短服务导致特征失效的问题;并通过构建成本敏感学习的检测模型,解决了数据类别不均衡的问题。实验结果表明,与现有的方法相比,提出的方法不仅能应对缩短服务的问题,还能在类别不均衡的恶意网页检测任务中表现出较低的漏报率2.1%和误报率3.3%。此外,在25万条无标签数据集上,该方法比反病毒工具VirusTotal的查全率提升了38.2%。

关键词: 恶意网页检测, 缩短服务, 链接层次语义树, 成本敏感

Abstract:

Aiming at the problem that attackers use URL shortening services to cause invalid detection of malicious webpages that rely only on URL characteristics, and the problem of highly unbalanced malicious and benign webpages in malicious webpage detection, this paper proposes a cost-sensitive learning method for malicious webpages that incorporates the features of the hierarchical semantic tree of webpage content. This method solves the problem of feature invalidation caused by URL shortening service by constructing a semantic tree of webpage content link hierarchy and extracting features based on the semantic tree. It constructs a cost-sensitive learning detection model to solve the problem of imbalanced data. Experimental results show that compared with the existing methods, the method proposed in this paper can not only deal with the problem of shortening the service, but also show a lower false negative rate of 2.1% and a false negative rate of 3.3% in the detection of unbalanced malicious web pages. In addition, on 250,000 unlabeled data sets, the method improves the recall rate by 38. 2% compared to the anti-virus tool VirusTotal.

Key words: malicious webpage detection, URL shortening service, link level semantic tree, cost sensitive