计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (24): 84-90.DOI: 10.3778/j.issn.1002-8331.1809-0259

• 网络、通信与安全 • 上一篇    下一篇

基于URL语言特征的钓鱼网站检测算法

王雨琪,刘博文,林果园   

  1. 1.中国矿业大学 计算机科学与技术学院,江苏 徐州 221116
    2.矿山数字化教育部工程研究中心,江苏 徐州 221116
    3.南京大学 计算机软件新技术国家重点实验室,南京 210023
  • 出版日期:2019-12-15 发布日期:2019-12-11

Phishing Detection Algorithm Based on Language Features of URL

WANG Yuqi, LIU Bowen, LIN Guoyuan   

  1. 1.School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China
    2.Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, Jiangsu 221116, China
    3.State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Online:2019-12-15 Published:2019-12-11

摘要: 为了应对钓鱼网站的检测逃避策略,提出一种基于URL语言特征的钓鱼网站检测算法。通过分析钓鱼网站和合法网站的URL在不同检测域上的差异,定义基元和敏感度来描述其语言特征。先根据基元对主级域名进行相似性检测,当相似性低于预先设定的阈值时,选取有效的子域名特征,利用随机森林算法对子域名的语言特征进行学习和检测。实验结果表明,该算法的准确率达95.6%,系统运行时间相对较小,平均识别时间小于1 s。

关键词: 钓鱼网站, 统一资源定位符(URL), 语言特征, 基元, 敏感度

Abstract: In order to deal with detection avoidance strategies of phishing sites, a phishing detection algorithm based on language features of URL is proposed. Through analyzing the differences in different detection domains of phishing sites and legal sites, the concept of motif and sensitivity is defined to describe language features. First of all, the similarity of main level domain is detected based on motif. When the similarity is lower than the pre-set threshold, valid subdomain features are selected. Then language features of subdomains are studied and detected using random forests. The results show that the accuracy rate of the proposed algorithm is 95.6%. The system running time is relatively less, and the average recognition time is less than 1 s.

Key words: phishing site, Uniform Resource Locator(URL), language feature, motif, sensitivity