Application of Simhash Algorithm in Text Deduplication

doi:10.3778/j.issn.1002-8331.1902-0246

Abstract

Abstract:

To improve the text deduplication effect and accuracy of Simhash algorithm, as well as to solve the shortcomings of Simhash algorithm that cannot reflect the distribution information, an improved Simhash algorithm based on information entropy weighting, abbreviated as E-Simhash, is proposed in this paper. Firstly, by introducing TF-IDF and information entropy, optimizing the weight and threshold calculation in Simhash algorithm, as well as adding the text distribution information, the final generated fingerprint can better embody the proportion of key information. Meanwhile, the correlation between fingerprint information and weight is also be certificated. Finally, the experimental results demonstrate that the performance of Simhash algorithm can be effectively improved by optimizing the weight. The modified algorithm is superior to the traditional Simhash algorithm in terms of deduplication rate, recall rate and F value, and also has good performance in Chinese similarity detection. Thus, the effectiveness and accuracy of the proposed method are verified.

Key words: Simhash, information entropy, term frequency-inverse document frequency, weight optimization, text deduplication

摘要：

为了提升Simhash算法的文本去重效果、准确率，解决Simhash算法无法体现分布信息的缺点，提出了基于信息熵加权的Simhash算法（简称E-Simhash）。该算法引入TF-IDF和信息熵，通过优化Simhash算法中的权重及阈值计算，增加文本分布信息，使得最终生成的指纹更能体现关键信息的比重，并对指纹信息与权重的关联性进行了分析。仿真实验表明：优化权重计算能有效地提升Simhash算法的性能，E-Simhash算法在去重率、召回率、[F]值等方面均优于传统Simhash算法，并且在文本去重方面取得了良好的效果。

关键词: Simhash, 信息熵, 词频-逆向文件频率, 权重优化, 文本去重

ZHANG Hang, SHENG Zhiwei, ZHANG Shibin, YANG Min. Application of Simhash Algorithm in Text Deduplication[J]. Computer Engineering and Applications, 2020, 56(11): 246-251.

张航，盛志伟，张仕斌，杨敏. Simhash算法在文本去重中的应用[J]. 计算机工程与应用, 2020, 56(11): 246-251.

[1]	WANG Peng, YE Xueyi, WANG Tao, QIAN Dingwei. Face Recognition Based on Double Variation and Double Space Local Directional Pattern [J]. Computer Engineering and Applications, 2021, 57(4): 91-99.
[2]	CHEN Xiaohan, WEI Shuning, QIN Zhengze. Malware Family Classification Based on Deep Learning Visualization [J]. Computer Engineering and Applications, 2021, 57(22): 131-138.
[3]	JIANG Kui, QIU Yuandong, ZHENG Haocheng. ICMPv6 DDoS Attack Detection Method Based on Information Entropy and LSTM [J]. Computer Engineering and Applications, 2021, 57(21): 148-154.
[4]	SONG Shijie, CHEN Kaiyan, ZHANG Yang. Security Evaluation Framework of Deep Learning Side Channel Analysis from Information Entropy [J]. Computer Engineering and Applications, 2021, 57(17): 138-146.
[5]	ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework [J]. Computer Engineering and Applications, 2021, 57(13): 96-101.
[6]	CAO Junbo，YE Xia，XU Feixiang，YIN Liedong. Improved CBOW Emotional Information Acquisition Research [J]. Computer Engineering and Applications, 2020, 56(9): 142-147.
[7]	CHEN Jiancu, WANG Yue, ZHU Xiaofei, LI Zhangyu, LIN Zhihang. Wild Animal Video Object Detection Method Combining Multi-feature Map [J]. Computer Engineering and Applications, 2020, 56(7): 221-227.
[8]	LIN Kezheng, ZHANG Yuanming, LI Haotian. Research on HOG Feature Extraction Algorithm Weighted by Information Entropy [J]. Computer Engineering and Applications, 2020, 56(6): 147-152.
[9]	BAI Fengbo, CHANG Lin, WANG Shifan, LI Bin, WANG Yingjie, ZHOU Hong, LIU Yao. Improved Method Study on Extracting Keywords in Chinese Judgment Documents [J]. Computer Engineering and Applications, 2020, 56(23): 153-160.
[10]	HUANG Xuebo, XU Zhengguo, YAN Jikun. High-Frequency Similar Sequence Extraction Algorithm of Protocol Data Based on Simhash [J]. Computer Engineering and Applications, 2020, 56(16): 199-203.
[11]	HUANG Dongmei, LIANG Suling, WANG Zhenhua, SUN Jingqi, XU Shoujue. Dimensionality Reduction Method for Hyperspectral Remote Sensing Image Based on Information Entropy [J]. Computer Engineering and Applications, 2019, 55(6): 191-196.
[12]	LI Yaohua, WANG Xingzhou. Fault Diagnosis of Aircraft Hydraulic System [J]. Computer Engineering and Applications, 2019, 55(5): 232-236.
[13]	ZHANG Hua, CAO Lin. Face Sketch Synthesis Method Combining pHash and Sparse Coding [J]. Computer Engineering and Applications, 2019, 55(22): 187-194.
[14]	LI Peizhen, WANG Bin, NIU Yan, TIAN Cheng, XIANG Jie. Research on EEG Classification of Schizophrenia Based on Information Entropy of Functional Connection [J]. Computer Engineering and Applications, 2019, 55(22): 239-244.
[15]	PENG Shouzhen. Pythagorean Fuzzy Decision-Making Model Based on Information Measures and Its Application [J]. Computer Engineering and Applications, 2019, 55(19): 185-190.

Application of Simhash Algorithm in Text Deduplication

Simhash算法在文本去重中的应用

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics