High-Frequency Similar Sequence Extraction Algorithm of Protocol Data Based on Simhash

doi:10.3778/j.issn.1002-8331.1907-0133

Abstract

Abstract:

In the feature extraction problem of network protocol, the existing algorithms based on frequency statistics and sequence alignment have some shortcomings in time efficiency and accuracy, so a high-frequency similar sequence extraction algorithm based on Simhash is proposed. The traditional Simhash algorithm is generally used in the field of text processing, the protocol data are processed by word segmentation according to the characteristics of binary sequences, and methods such as reducing the length of hash results and the number of comparisons are adopted to further improve the algorithm efficiency. Finally, Simhash is suitable for the extraction of high-frequency similar sequences. Experimental results show that the average coverage rate of the algorithm is 74.28%, and the time efficiency is higher under the condition of such accuracy.

Key words: protocol analysis, binary sequence, Simhash, high-frequency similar sequence

摘要：

在网络协议特征提取问题中，已有的基于频率统计和序列比对等算法在时间效率和准确率上有一定缺陷，因此提出了一种基于Simhash的高频相似序列提取方法。针对传统的Simhash算法一般用于文本处理领域的问题，根据二进制序列的特点将协议数据进行“分词”处理，并采用了减少哈希结果长度、降低比较次数等方法进一步提高算法效率，最终使Simhash适合于高频相似序列提取问题。实验结果表明，该算法的平均覆盖率达到74.28%，并且在此准确率的条件下时间效率较高。

关键词: 协议分析, 二进制序列, Simhash, 高频相似序列

HUANG Xuebo, XU Zhengguo, YAN Jikun. High-Frequency Similar Sequence Extraction Algorithm of Protocol Data Based on Simhash[J]. Computer Engineering and Applications, 2020, 56(16): 199-203.

黄学波，徐正国，燕继坤. 基于Simhash的协议数据高频相似序列提取算法[J]. 计算机工程与应用, 2020, 56(16): 199-203.

[1]	CHEN Xiaohan, WEI Shuning, QIN Zhengze. Malware Family Classification Based on Deep Learning Visualization [J]. Computer Engineering and Applications, 2021, 57(22): 131-138.
[2]	WANG Tong, ZHU Minling. Study on Fast Realization of Serial Test and Approximate Entropy Test [J]. Computer Engineering and Applications, 2020, 56(15): 113-117.
[3]	ZHANG Hang, SHENG Zhiwei, ZHANG Shibin, YANG Min. Application of Simhash Algorithm in Text Deduplication [J]. Computer Engineering and Applications, 2020, 56(11): 246-251.
[4]	LI Zhiyuan1，2. Analysis and research on ARES P2P file sharing protocol [J]. Computer Engineering and Applications, 2016, 52(24): 1-5.
[5]	XIAO Hongguang, CHEN Rong, WU Xiaorong, SHI Changqiong, YAN Lihui, ZOU Qiang. Mobile RFID security authentication protocol based on dynamic key [J]. Computer Engineering and Applications, 2016, 52(22): 113-117.
[6]	HAN Rui, ZHANG Xuefeng. Pseudo-random sequence generating method based on high dimensional cat map [J]. Computer Engineering and Applications, 2016, 52(10): 91-99.
[7]	CHEN Guangzhi1, ZHUO Hankui2, LI Lei1. General framework and its implementation for translating binary data flow [J]. Computer Engineering and Applications, 2015, 51(20): 5-10.
[8]	XU Kai1，2, SHA Ying2, LI Yang3, SHAN Jixi2, WANG Xiaoyan2. Twitter repeat messages analysis and processing [J]. Computer Engineering and Applications, 2014, 50(21): 111-115.
[9]	GAO Xiang1, LI Bing2. Research on method to detect reduplicative Chinese short texts [J]. Computer Engineering and Applications, 2014, 50(16): 192-197.
[10]	LI Hongyan, YANG Wanli. Research on spatiotemporal chaos binarization method [J]. Computer Engineering and Applications, 2013, 49(21): 65-69.
[11]	JIN Huilong1，2, XU Chengqian1. Punctured difference set pairs and approach for study of pseudorandom punctured binary sequence pairs [J]. Computer Engineering and Applications, 2012, 48(25): 24-27.
[12]	XUE Yi1, YE Xiaojun2. Interface automata model of database activity monitoring tool [J]. Computer Engineering and Applications, 2012, 48(18): 110-114.
[13]	CAO Jia. Security protocol for P2P card games [J]. Computer Engineering and Applications, 2011, 47(27): 8-10.
[14]	PAN Ju,ZHU Jian-ming. Improvement of Kim’s fair non-repudiation protocol [J]. Computer Engineering and Applications, 2009, 45(4): 102-104.
[15]	MO Le-qun¹,GUO Geng-qi¹,YAO Guo-xiang². Intrusion detection method based on clustering and protocol analysis [J]. Computer Engineering and Applications, 2009, 45(14): 81-83.

High-Frequency Similar Sequence Extraction Algorithm of Protocol Data Based on Simhash

基于Simhash的协议数据高频相似序列提取算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics