Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (16): 199-203.DOI: 10.3778/j.issn.1002-8331.1907-0133

Previous Articles     Next Articles

High-Frequency Similar Sequence Extraction Algorithm of Protocol Data Based on Simhash

HUANG Xuebo, XU Zhengguo, YAN Jikun   

  1. State Key Laboratory of Blind Signals Processing, Chengdu 610041, China
  • Online:2020-08-15 Published:2020-08-11



  1. 盲信号处理国家重点实验室,成都 610041


In the feature extraction problem of network protocol, the existing algorithms based on frequency statistics and sequence alignment have some shortcomings in time efficiency and accuracy, so a high-frequency similar sequence extraction algorithm based on Simhash is proposed. The traditional Simhash algorithm is generally used in the field of text processing, the protocol data are processed by word segmentation according to the characteristics of binary sequences, and methods such as reducing the length of hash results and the number of comparisons are adopted to further improve the algorithm efficiency. Finally, Simhash is suitable for the extraction of high-frequency similar sequences. Experimental results show that the average coverage rate of the algorithm is 74.28%, and the time efficiency is higher under the condition of such accuracy.

Key words: protocol analysis, binary sequence, Simhash, high-frequency similar sequence



关键词: 协议分析, 二进制序列, Simhash, 高频相似序列