计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (1): 105-110.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

一种面向大规模短特征集的字符串匹配技术

李志文,张  伟   

  1. 北京信息科技大学 计算机学院,北京 100085
  • 出版日期:2014-01-01 发布日期:2013-12-30

String matching technology for large-scale short pattern set

LI Zhiwen, ZHANG Wei   

  1. School of Computer, Beijing Information Science and Technology University, Beijing 100085, China
  • Online:2014-01-01 Published:2013-12-30

摘要: 面向大规模特征集的字符串匹配技术在病毒检测、内容过滤等问题上的应用愈加广泛,而短模式串一直是阻碍性能提升的重要瓶颈。针对短模式串进行分析讨论,基于跳跃算法优化,采用了动态块大小和动态Hash处理以及Hash函数设计场景化的策略,同时探讨了多核处理器与多线程设计之间的关系。实验数据证明改进的算法策略具有支撑百万级特征集字符串匹配的能力。

关键词: 大规模特征集, 字符串匹配, 短模式串, Hash函数, 多线程技术

Abstract: Large-scale pattern set for the string matching technology in virus detection, content filtering and other applications become widespread increasingly, while the short pattern has been a major bottleneck impeding performance improvements. Based on optimization of the jump algorithm, short pattern strings are analyzed and discussed, and strategies of dynamic block size, dynamic Hash processing and the scene of the Hash function design are applied, besides, the relationship between multi-core processors and multi-threading design is explored. Experimental data prove that the improved algorithm is available to support one million size pattern set for string matching.

Key words: large-scale pattern set;string matching, short pattern, Hash function, multi-threading technology