%0 Journal Article %A ZHANG Hang %A SHENG Zhiwei %A ZHANG Shibin %A YANG Min %T Application of Simhash Algorithm in Text Deduplication %D 2020 %R 10.3778/j.issn.1002-8331.1902-0246 %J Computer Engineering and Applications %P 246-251 %V 56 %N 11 %X

To improve the text deduplication effect and accuracy of Simhash algorithm, as well as to solve the shortcomings of Simhash algorithm that cannot reflect the distribution information, an improved Simhash algorithm based on information entropy weighting, abbreviated as E-Simhash, is proposed in this paper. Firstly, by introducing TF-IDF and information entropy, optimizing the weight and threshold calculation in Simhash algorithm, as well as adding the text distribution information, the final generated fingerprint can better embody the proportion of key information. Meanwhile, the correlation between fingerprint information and weight is also be certificated. Finally, the experimental results demonstrate that the performance of Simhash algorithm can be effectively improved by optimizing the weight. The modified algorithm is superior to the traditional Simhash algorithm in terms of deduplication rate, recall rate and F value, and also has good performance in Chinese similarity detection. Thus, the effectiveness and accuracy of the proposed method are verified.

%U http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.1902-0246