计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (4): 63-68.DOI: 10.3778/j.issn.1002-8331.1811-0097

• 网络、通信与安全 • 上一篇    下一篇

基于密度聚类和随机森林的移动应用识别技术

朱迪,陈丹伟   

  1. 南京邮电大学 计算机学院、软件学院、网络空间安全学院,南京 210023
  • 出版日期:2020-02-15 发布日期:2020-03-06

Technology of Mobile Application Identification Based on Density-Based Clustering and Random Forest

ZHU Di, CHEN Danwei   

  1. School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
  • Online:2020-02-15 Published:2020-03-06

摘要:

随着移动终端设备的蓬勃发展,移动应用种类的日益增加,移动应用类型识别成为网络管理、市场营销以及网络攻击防范等领域中一种具有重要意义的技术手段。在实际应用中,几乎所有的移动应用程序都采用SSL/TLS(Secure Sockets Layer/Transport Layer Security)协议进行数据加密,因此使移动应用类型识别工作更具挑战。提出了一种新颖的加密环境下Android移动应用类型识别技术。该技术利用信息熵对DBSCAN(Density-Based Spatial Clustering of Applications with Noise)聚类算法生成的聚类簇进行纯度分析,通过实验合理设置熵阈值对数据集中的干扰样本进行过滤,最后利用随机森林算法对过滤后的数据集进行建模,实现了移动应用程序类型的识别。由于仅通过捕捉加密数据流传输模式实现应用识别,对于加密和非加密流量均有效。实验表明所述方法缓解了干扰样本的误判问题,有效地提高了数据集利用率,具有更高的识别准确率和召回率。

关键词: 加密流量分析, DBSCAN, 随机森林

Abstract:

With the rapid development of smart mobile terminals and the increasing variety of mobile applications, the identification of mobile application has become an important technology in the fields of network management, marketing and network attack prevention. Actually, almost all mobile applications use the SSL/TLS(Secure Sockets Layer/Transport Layer Security) protocol to encrypt data, making the identification of mobile application more challenging. This paper proposes a novel methodology for the identification of Android Apps from their encrypted network traffic. The method employs the information entropy to analyze the clusters generated by the DBSCAN(Density-Based Spatial Clustering of Applications with Noise) clustering algorithm, and filters the noise samples in the dataset by experimentally setting the entropy threshold. Finally, the classifier is trained by feeding the filtered dataset to the random forest algorithm. Since this method implements application identification only by capturing the transmission pattern of encrypted flow, it is effective for both encrypted and non-encrypted traffic. Experiments show that the methodology alleviates the misjudgment of noise samples, effectively improves the dataset utilization, and has higher recognition accuracy and recall rate.

Key words: encrypted traffic analysis, DBSCAN, random forest