计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (11): 246-251.DOI: 10.3778/j.issn.1002-8331.1902-0246

• 工程与应用 • 上一篇    下一篇

Simhash算法在文本去重中的应用

张航,盛志伟,张仕斌,杨敏   

  1. 成都信息工程大学 网络空间安全学院,成都 610225
  • 出版日期:2020-06-01 发布日期:2020-06-01

Application of Simhash Algorithm in Text Deduplication

ZHANG Hang, SHENG Zhiwei, ZHANG Shibin, YANG Min   

  1. School of Cybersecurity, Chengdu University of Information Technology, Chengdu 610225, China
  • Online:2020-06-01 Published:2020-06-01

摘要:

为了提升Simhash算法的文本去重效果、准确率,解决Simhash算法无法体现分布信息的缺点,提出了基于信息熵加权的Simhash算法(简称E-Simhash)。该算法引入TF-IDF和信息熵,通过优化Simhash算法中的权重及阈值计算,增加文本分布信息,使得最终生成的指纹更能体现关键信息的比重,并对指纹信息与权重的关联性进行了分析。仿真实验表明:优化权重计算能有效地提升Simhash算法的性能,E-Simhash算法在去重率、召回率、[F]值等方面均优于传统Simhash算法,并且在文本去重方面取得了良好的效果。

关键词: Simhash, 信息熵, 词频-逆向文件频率, 权重优化, 文本去重

Abstract:

To improve the text deduplication effect and accuracy of Simhash algorithm, as well as to solve the shortcomings of Simhash algorithm that cannot reflect the distribution information, an improved Simhash algorithm based on information entropy weighting, abbreviated as E-Simhash, is proposed in this paper. Firstly, by introducing TF-IDF and information entropy, optimizing the weight and threshold calculation in Simhash algorithm, as well as adding the text distribution information, the final generated fingerprint can better embody the proportion of key information. Meanwhile, the correlation between fingerprint information and weight is also be certificated. Finally, the experimental results demonstrate that the performance of Simhash algorithm can be effectively improved by optimizing the weight. The modified algorithm is superior to the traditional Simhash algorithm in terms of deduplication rate, recall rate and F value, and also has good performance in Chinese similarity detection. Thus, the effectiveness and accuracy of the proposed method are verified.

Key words: Simhash, information entropy, term frequency-inverse document frequency, weight optimization, text deduplication