计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (20): 75-82.DOI: 10.3778/j.issn.1002-8331.1704-0480

• 网络、通信与安全 • 上一篇    下一篇

基于URL混淆技术识别的钓鱼网页检测方法

丁  岩,努尔布力   

  1. 新疆大学 信息科学与工程学院,乌鲁木齐 830046
  • 出版日期:2017-10-15 发布日期:2017-10-31

Phishing detection method based on URL obfuscation technology recognition

DING Yan, Nurbol   

  1. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Online:2017-10-15 Published:2017-10-31

摘要: 针对钓鱼URL常用的混淆技术,提出一种基于规则匹配和逻辑回归的钓鱼网页检测方法(RMLR)。首先,使用针对违反URL命名标准及隐藏钓鱼目标词等混淆技术所构建的规则库对给定网页分类,若可判定其为钓鱼网址,则省略后续的特征提取及检测过程,以满足实时检测的需要。若未能直接判定为钓鱼网址,则提取该URL的相关特征,并使用逻辑回归分类器进行二次检测,以提升检测的适应性和准确率,并降低因规则库规模不足导致的误报率。同时,RMLR引入基于字符串相似度的Jaccard随机域名识别方法来辅助检测钓鱼URL。实验结果表明,RMLR准确率达到98.7%,具有良好的检测效果。

关键词: 钓鱼网页, 统一资源定位符(URL)混淆技术, 规则匹配, 机器学习

Abstract: Aiming at the obfuscation techniques commonly used in phishing URL, a phishing detection method (RMLR) based on rule matching and logical regression is proposed. First, it classifies a given web by using a rule base constructed based on some obfuscation techniques such as the violation of URL naming standards and hidden phishing target. If it can be judged as a phishing site, the subsequent feature extraction and detection process is omitted to meet the need of real-time detection. If it cannot be directly classified as phishing, then it extracts the URL’s features, and uses the logical regression classifier for secondary detection to improve the detection adaptability and accuracy, and avoids false positives due to lack of rules. At the same time, RMLR introduces the Jaccard random domain name recognition method based on string similarity to assist in detecting phishing URL. The experimental results show that the accuracy rate of the RMLR is 98.7%, which means a good performance on phishing detection.

Key words: phishing, Uniform/Universal Resource Locator(URL) obfuscation techniques, rule matching, machine learning