Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (11): 246-251.DOI: 10.3778/j.issn.1002-8331.1902-0246

Previous Articles     Next Articles

Application of Simhash Algorithm in Text Deduplication

ZHANG Hang, SHENG Zhiwei, ZHANG Shibin, YANG Min   

  1. School of Cybersecurity, Chengdu University of Information Technology, Chengdu 610225, China
  • Online:2020-06-01 Published:2020-06-01



  1. 成都信息工程大学 网络空间安全学院,成都 610225


To improve the text deduplication effect and accuracy of Simhash algorithm, as well as to solve the shortcomings of Simhash algorithm that cannot reflect the distribution information, an improved Simhash algorithm based on information entropy weighting, abbreviated as E-Simhash, is proposed in this paper. Firstly, by introducing TF-IDF and information entropy, optimizing the weight and threshold calculation in Simhash algorithm, as well as adding the text distribution information, the final generated fingerprint can better embody the proportion of key information. Meanwhile, the correlation between fingerprint information and weight is also be certificated. Finally, the experimental results demonstrate that the performance of Simhash algorithm can be effectively improved by optimizing the weight. The modified algorithm is superior to the traditional Simhash algorithm in terms of deduplication rate, recall rate and F value, and also has good performance in Chinese similarity detection. Thus, the effectiveness and accuracy of the proposed method are verified.

Key words: Simhash, information entropy, term frequency-inverse document frequency, weight optimization, text deduplication



关键词: Simhash, 信息熵, 词频-逆向文件频率, 权重优化, 文本去重