Semi-Supervised Network Traffic Classification Based on Feature Selection and Improved Tri-training

doi:10.3778/j.issn.1002-8331.2307-0344

Abstract

Abstract: Network traffic classification is significant for network management, and the current machine learning-based traffic classification methods suffer from labeling bottleneck and sample imbalance. To address these two problems, a semi-supervised network traffic classification model based on the combination of feature selection and improved Tri-training algorithm is proposed. Firstly, features that are highly correlated with classes but not with each other are selected based on the maximum information coefficient and Pearson’s coefficient, features that are beneficial to the classification of a few classes are selected by using the improved Relief F, and the selected features are combined to form an optimal feature subset to alleviate the impact of unbalanced data on classification. Then the traditional Tri-training algorithm is improved by combining the integration idea, optimization iteration and weighted decision making, and the improved Tri-training algorithm is used to solve the annotation bottleneck problem. Finally, experiments are conducted on the Moore dataset. The experimental results show that the proposed method achieves 95.26% on F-measure with the utilization of unbalanced small amounts of labeled data. It has better classification performance compared to advanced machine learning algorithms and the original Tri-training method and some of its improved algorithms.

Key words: semi-supervised network, class imbalance, network traffic classification, feature selection, Tri-training

摘要： 网络流量分类对网络管理意义重大，目前基于机器学习的流量分类方法存在标注瓶颈、样本不平衡的问题。针对这两个问题，提出一种基于特征选择与改进的Tri-training算法结合的半监督网络流量分类模型。根据最大信息系数、皮尔逊系数选择出与类高度相关但彼此不相关的特征，利用改进的Relief F选择出有利于少数类分类的特征，并将选择出的特征组合成最优特征子集缓解不平衡数据对分类的影响。结合集成思想，优化迭代和加权决策改进传统Tri-training算法，利用改进的Tri-training算法解决标注瓶颈问题。在Moore数据集上进行了实验，实验结果表明提出的方法在利用不平衡的少量有标记的数据下在F-measure上达到了95.26%，与先进的机器学习算法和原始Tri-training方法及其一些改进算法相比具有更好的分类性能。

关键词: 半监督网络, 类不平衡, 网络流量分类, 特征选择, Tri-training

LI Daoquan, ZHU Shengkai, ZHAI Yuyang, HU Yifan. Semi-Supervised Network Traffic Classification Based on Feature Selection and Improved Tri-training[J]. Computer Engineering and Applications, 2024, 60(23): 275-285.

李道全, 祝圣凯, 翟豫阳, 胡一帆. 基于特征选择与改进的Tri-training的半监督网络流量分类[J]. 计算机工程与应用, 2024, 60(23): 275-285.

References

[1] GLENNAN T, LECKIE C, ERFANI S M. Improved classification of known and unknown network traffic flows using semi-supervised machine learning[C]//Proceedings of the Australasian Conference on Information Security and Privacy, 2016: 493-501.
[2] ZHANG J, CHEN C, XIANG Y, et al. Internet traffic classification by aggregating correlated Naive Bayes predictions[J]. IEEE Transactions on Information Forensics and Security, 2012, 8(1): 5-15.
[3] 陈子涵, 程光, 徐子恒, 等. 互联网加密流量检测、分类与识别研究综述[J]. 计算机学报, 2023, 46(5): 1060-1085.
CHEN Z H, CHENG G, XU Z H, et al. A survey on internet encrypted traffic detection，classification and ldentification[J]. Chinese Journal of Computers, 2023, 46(5): 1060-1085.
[4] YOON S H, PARK J W, PARK J S, et al. Internet application traffic classification using fixed IP-port[C]//Proceedings of the Management Enabling the Future Internet for Changing Business and New Computing Services: 12th Asia-Pacific Network Operations and Management Symposium, 2009: 21-30.
[5] KARAGIANNIS T, PAPAGIANNAKI K, FALOUTSOS M. BLINC: multilevel traffic classification in the dark[C]//Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, 2005: 229-240.
[6] CASCARANO N, CIMINIERA L, RISSO F. Improving cost and accuracy of DPI traffic classifiers[C]//Proceedings of the 2010 ACM Symposium on Applied Computing, 2010: 641-646.
[7] MOORE A W, ZUEV D. Internet traffic classification using Bayesian analysis techniques[C]//Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems, 2005: 50-60.
[8] LI J, ZHANG S Y, LU Y Q, et al. Internet traffic classification using machine learning[C]//Proceedings of the 2nd International Conference on Communications and Networking in China, 2007: 239-243.
[9] 周剑峰, 阳爱民, 刘吉财. 基于改进的C4.5算法的网络流量分类方法[J]. 计算机工程与应用, 2012, 48(5): 71-74.
ZHOU J F, YANG A M, LIU J C. Traffic classification approach based on improved C4.5 algorithm[J]. Computer Engineering and Applications, 2012, 48(5): 71-74.
[10] ERMAN J, ARLITT M, MAHANTI A. Traffic classification using clustering algorithms[C]//Proceedings of the SIGCOMM Workshop on Mining Network Data, 2006: 281-286.
[11] ZANDER S, NGUYEN T, ARMITAGE G. Automated traffic classification and application identification using machine learning[C]//Proceedings of the IEEE Conference on Local Computer Networks 30th Anniversary, 2005: 250-257.
[12] FINAMORE A, MELLIA M, MEO M. Mining unclassified traffic using automatic clustering techniques[C]//Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data, 2011: 150-163.
[13] ERMAN J, MAHANTI A, ARLITT M, et al. Offline/realtime traffic classification using semi-supervised learning[J]. Performance Evaluation, 2007, 64(9/10/11/12): 1194-1213.
[14] NOORBEHBAHANI F, MANSOORI S. A new semi-supervised method for network traffic classification based on X-means clustering and label propagation[C]//Proceedings of the 2018 8th International Conference on Computer and Knowledge Engineering, 2018: 120-125.
[15] ZHOU Z H, LI M. Tri-training: exploiting unlabeled data using three classifiers[J]. IEEE Transactions on knowledge and Data Engineering, 2005, 17(11): 1529-1541.
[16] ZHAO S, ZHANG Y, CHANG P. Network traffic classification using tri-training based on statistical flow characteristics[C]//Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, 2017: 323-330.
[17] 张永, 陈蓉蓉, 张晶. 基于交叉熵的安全Tri-training算法[J]. 计算机研究与发展, 2021, 58(1): 60-69.
ZHANG Y, CHEN R R, ZHANG J. Safe Tri-training algorithm based on cross entropy[J]. Journal of Computer Research and Development, 2021, 58(1): 60-69.
[18] ZHAO J, LI S, WU R, et al. Tri-training algorithm based on cross entropy and K-nearest neighbors for network intrusion detection[J]. KSII Transactions on Internet & Information Systems, 2022, 16(12): 3889-3903.
[19] ROBNIK-?IKONJA M, KONONENKO I. Theoretical and empirical analysis of ReliefF and RReliefF[J]. Machine Learning, 2003, 53: 23-69.
[20] 李道全, 李腾, 李玉秀. 基于自适应特征选择与KNN的网络流量分类研究[J]. 计算机工程与应用, 2023, 59(12): 270-277.
LI D Q, LI T, LI Y X. Research on network traffic classification based on adaptive feature selection and KNN[J]. Computer Engineering and Applications, 2023, 59(12): 270-277.
[21] HALL M A. Correlation-based feature selection of discrete and numeric class machine learning[C]//Proceedings of the 17th International Conference on Machine Learning, 2000: 359-366.
[22] JIA L H, GUO L Z, ZHOU Z, et al. LAMDA-SSL: semi-supervised learning in Python[J]. arXiv:2208.04610, 2022.
[23] 任正雄, 韩华, 崔晓钰, 等. 基于Tri-Training的制冷系统半监督故障诊断[J]. 制冷学报, 2022, 43(4): 129-136.
REN Z X, HAN H, CUI X Y, et al. Semi-supervised fault diagnosis of refrigeration system based on Tri-Training[J]. Journal of Refrigeration, 2022, 43(4): 129-136.
[24] 胡云青, 邱清盈, 余秀, 等. 基于改进三体训练法的半监督专利文本分类方法[J]. 浙江大学学报 (工学版), 2020, 54(2): 331-339.
HU Y Q, QIU Q Y, YU X, et al. Semi-supervised patent text classification method based on improved tri-training algorithm[J]. Journal of Zhejiang University (Engineering Science), 2020, 54(2): 331-339.
[25] HUA W, WANG S, ZHAO Y, et al. Semi-supervised PolSAR classification based on improved tri-training[C]//Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium, 2017: 3937-3940.