计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (23): 275-285.DOI: 10.3778/j.issn.1002-8331.2307-0344

• 网络、通信与安全 • 上一篇    下一篇

基于特征选择与改进的Tri-training的半监督网络流量分类

李道全,祝圣凯,翟豫阳,胡一帆   

  1. 青岛理工大学 信息与控制工程学院,山东 青岛 266520
  • 出版日期:2024-12-01 发布日期:2024-11-29

Semi-Supervised Network Traffic Classification Based on Feature Selection and Improved Tri-training

LI Daoquan, ZHU Shengkai, ZHAI Yuyang, HU Yifan   

  1. School of Information and Control Engineering, Qingdao University of Technology, Qingdao, Shandong 266520, China
  • Online:2024-12-01 Published:2024-11-29

摘要: 网络流量分类对网络管理意义重大,目前基于机器学习的流量分类方法存在标注瓶颈、样本不平衡的问题。针对这两个问题,提出一种基于特征选择与改进的Tri-training算法结合的半监督网络流量分类模型。根据最大信息系数、皮尔逊系数选择出与类高度相关但彼此不相关的特征,利用改进的Relief F选择出有利于少数类分类的特征,并将选择出的特征组合成最优特征子集缓解不平衡数据对分类的影响。结合集成思想,优化迭代和加权决策改进传统Tri-training算法,利用改进的Tri-training算法解决标注瓶颈问题。在Moore数据集上进行了实验,实验结果表明提出的方法在利用不平衡的少量有标记的数据下在F-measure上达到了95.26%,与先进的机器学习算法和原始Tri-training方法及其一些改进算法相比具有更好的分类性能。

关键词: 半监督网络, 类不平衡, 网络流量分类, 特征选择, Tri-training

Abstract: Network traffic classification is significant for network management, and the current machine learning-based traffic classification methods suffer from labeling bottleneck and sample imbalance. To address these two problems, a semi-supervised network traffic classification model based on the combination of feature selection and improved Tri-training algorithm is proposed. Firstly, features that are highly correlated with classes but not with each other are selected based on the maximum information coefficient and Pearson’s coefficient, features that are beneficial to the classification of a few classes are selected by using the improved Relief F, and the selected features are combined to form an optimal feature subset to alleviate the impact of unbalanced data on classification. Then the traditional Tri-training algorithm is improved by combining the integration idea, optimization iteration and weighted decision making, and the improved Tri-training algorithm is used to solve the annotation bottleneck problem. Finally, experiments are conducted on the Moore dataset. The experimental results show that the proposed method achieves 95.26% on F-measure with the utilization of unbalanced small amounts of labeled data. It has better classification performance compared to advanced machine learning algorithms and the original Tri-training method and some of its improved algorithms.

Key words: semi-supervised network, class imbalance, network traffic classification, feature selection, Tri-training