中文短文本去重方法研究

计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (16): 192-197.

中文短文本去重方法研究

高翔1，李兵2

1.北京大学汇丰商学院，广东深圳 518055
2.对外经济贸易大学信息学院，北京 100029

出版日期:2014-08-15 发布日期:2014-08-14

Research on method to detect reduplicative Chinese short texts

GAO Xiang1, LI Bing2

1.Peking University HSBC Business School, Shenzhen, Guangdong 518055, China
2.School of Information Technology & Management, University of International Business and Economics, Beijing 100029, China

Online:2014-08-15 Published:2014-08-14

摘要/Abstract

摘要： 针对中文短文本冗余问题，提出了有效的去重算法框架。考虑到短文本海量性和简短性的特点，以及中文与英文之间的区别，引入了Bloom Filter、Trie树以及SimHash算法。算法框架的第一阶段由Bloom Filter或Trie树进行完全去重，第二阶段由SimHash算法进行相似去重。设计了该算法框架的各项参数，并通过仿真实验证实了该算法框架的可行性及合理性。

关键词: 文本去重, 中文短文本, Bloom Filter, Trie树, SimHash算法

Abstract: The article presents an effective algorithm framework for text de-duplication, focusing on redundancy problem of Chinese short texts. In view of the brevity and huge volumes of short texts, Bloom Filter have been introduced, Trie tree and the SimHash algorithm have been introduced. In the first stage of the algorithm framework, Bloom Filter or Trie tree is designed to remove duplications completely；in the second stage, the SimHash algorithm is used to detect similar duplications. This text has designed the parameters used in the algorithm framework, and the feasibility and rationality is testified.

Key words: text de-duplication, Chinese short texts, Bloom Filter, Trie tree, SimHash algorithm

高翔1，李兵2. 中文短文本去重方法研究[J]. 计算机工程与应用, 2014, 50(16): 192-197.

GAO Xiang1, LI Bing2. Research on method to detect reduplicative Chinese short texts[J]. Computer Engineering and Applications, 2014, 50(16): 192-197.

[1]	张航，盛志伟，张仕斌，杨敏. Simhash算法在文本去重中的应用[J]. 计算机工程与应用, 2020, 56(11): 246-251.
[2]	黄思猛1，程良伦2，王涛2. 基于双数组trie树的多模式复杂事件检测方法[J]. 计算机工程与应用, 2019, 55(4): 91-95.
[3]	戴震1，2，程光1，2. 基于通信特征的APT攻击检测方法[J]. 计算机工程与应用, 2017, 53(18): 77-83.
[4]	王文帅1，2，杜然1，2，程耀东1，陈刚1. 一种面向大规模微博数据的话题挖掘方法[J]. 计算机工程与应用, 2014, 50(22): 32-37.
[5]	徐凯1，2，沙瀛2，李阳3，单既喜2，王晓岩2. Twitter中重复消息的分析和处理[J]. 计算机工程与应用, 2014, 50(21): 111-115.
[6]	张墨华，张永强. 划分位无冲突哈希在trie树分组中的研究[J]. 计算机工程与应用, 2012, 48(11): 88-92.
[7]	赵骞，崔益民，邹涛. Bloom filter在网络取证中的应用研究[J]. 计算机工程与应用, 2010, 46(14): 91-94.
[8]	申庆永张建忠何云杨洁. 中文垃圾邮件过滤系统中的实时分词算法设计[J]. 计算机工程与应用, 2007, 43(3期): 179-179.
[9]	刘渊,刘元珍,李小航. 一种新的基于SCBF的流抽样测量算法研究[J]. 计算机工程与应用, 2007, 43(29): 140-142.
[10]	石振国,刘宗田. 网格资源实概念格模型及其算法研究[J]. 计算机工程与应用, 2006, 42(35期): 2-.
[11]	苏勇,周敬利,余胜生,姜明,刘钢. 基于Ａｇｅｎｔ自主存储系统的数据定位机制 [J]. 计算机工程与应用, 2006, 42(23期): 4-.