基于URL语言特征的钓鱼网站检测算法

doi:10.3778/j.issn.1002-8331.1809-0259

计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (24): 84-90.DOI: 10.3778/j.issn.1002-8331.1809-0259

基于URL语言特征的钓鱼网站检测算法

王雨琪，刘博文，林果园

1.中国矿业大学计算机科学与技术学院，江苏徐州 221116
2.矿山数字化教育部工程研究中心，江苏徐州 221116
3.南京大学计算机软件新技术国家重点实验室，南京 210023

出版日期:2019-12-15 发布日期:2019-12-11

Phishing Detection Algorithm Based on Language Features of URL

WANG Yuqi, LIU Bowen, LIN Guoyuan

1.School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China
2.Mine Digitization Engineering Research Center of the Ministry of Education, Xuzhou, Jiangsu 221116, China
3.State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

Online:2019-12-15 Published:2019-12-11

摘要/Abstract

摘要： 为了应对钓鱼网站的检测逃避策略，提出一种基于URL语言特征的钓鱼网站检测算法。通过分析钓鱼网站和合法网站的URL在不同检测域上的差异，定义基元和敏感度来描述其语言特征。先根据基元对主级域名进行相似性检测，当相似性低于预先设定的阈值时，选取有效的子域名特征，利用随机森林算法对子域名的语言特征进行学习和检测。实验结果表明，该算法的准确率达95.6%，系统运行时间相对较小，平均识别时间小于1 s。

关键词: 钓鱼网站, 统一资源定位符（URL）, 语言特征, 基元, 敏感度

Abstract: In order to deal with detection avoidance strategies of phishing sites, a phishing detection algorithm based on language features of URL is proposed. Through analyzing the differences in different detection domains of phishing sites and legal sites, the concept of motif and sensitivity is defined to describe language features. First of all, the similarity of main level domain is detected based on motif. When the similarity is lower than the pre-set threshold, valid subdomain features are selected. Then language features of subdomains are studied and detected using random forests. The results show that the accuracy rate of the proposed algorithm is 95.6%. The system running time is relatively less, and the average recognition time is less than 1 s.

Key words: phishing site, Uniform Resource Locator（URL）, language feature, motif, sensitivity

王雨琪，刘博文，林果园. 基于URL语言特征的钓鱼网站检测算法[J]. 计算机工程与应用, 2019, 55(24): 84-90.

WANG Yuqi, LIU Bowen, LIN Guoyuan. Phishing Detection Algorithm Based on Language Features of URL[J]. Computer Engineering and Applications, 2019, 55(24): 84-90.

[1]	高琦，李红娇. 面向用电数据的周期敏感度差分隐私保护方法[J]. 计算机工程与应用, 2020, 56(20): 73-81.
[2]	朱世起，努尔布力. 钓鱼网站检测研究现状与发展趋势的计量分析[J]. 计算机工程与应用, 2020, 56(15): 92-100.
[3]	曹霞1，李平1，2，张路遥1. 基于领域敏感兴趣圈的社会化推荐算法[J]. 计算机工程与应用, 2019, 55(4): 84-90.
[4]	周健，田萱，崔晓晖. 基于改进Sequence-to-Sequence模型的文本摘要生成方法[J]. 计算机工程与应用, 2019, 55(1): 128-134.
[5]	蔡文彬1，魏云龙1，徐海华2，潘林1. 混合单元选择语音合成系统的目标代价构建[J]. 计算机工程与应用, 2018, 54(24): 20-25.
[6]	丁岩，努尔布力. 基于URL混淆技术识别的钓鱼网页检测方法[J]. 计算机工程与应用, 2017, 53(20): 75-82.
[7]	王先超，王康喆，王春生，孙娓娓. 三值光计算机运算器网的拓扑性质[J]. 计算机工程与应用, 2016, 52(4): 84-87.
[8]	谢莉1，成运1，曾接贤2，余胜1. 基于颜色和梯度方向共生直方图的图像检索[J]. 计算机工程与应用, 2016, 52(10): 181-186.
[9]	顾晓清，王洪元，倪彤光，丁辉. 基于贝叶斯和支持向量机的钓鱼网站检测方法[J]. 计算机工程与应用, 2015, 51(4): 87-90.
[10]	左静，帅斌. 基于敏感度分析的车站应急预案生成算法研究[J]. 计算机工程与应用, 2015, 51(22): 239-242.
[11]	马宏炜，陆蓓，谌志群，黄孝喜，王荣波. 微博语言的复杂网络特征研究[J]. 计算机工程与应用, 2015, 51(19): 119-124.
[12]	周晴1，白瑞林1，李新2. 基于直线基元的实时定位与匹配方法[J]. 计算机工程与应用, 2014, 50(22): 228-232.
[13]	周晴1，吉峰2，白瑞林1. 基于圆弧基元的工件实时定位与匹配方法[J]. 计算机工程与应用, 2014, 50(15): 125-128.
[14]	张一帆1，2，赵红蕊2. 线形三维模型加载方法研究[J]. 计算机工程与应用, 2013, 49(6): 175-179.
[15]	游伟，雷定猷. 铁路超限超重货物装载加固可拓实例推理方法[J]. 计算机工程与应用, 2013, 49(18): 24-28.

基于URL语言特征的钓鱼网站检测算法

Phishing Detection Algorithm Based on Language Features of URL

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics