计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (16): 192-197.

• 信号处理 • 上一篇    下一篇

中文短文本去重方法研究

高  翔1,李  兵2   

  1. 1.北京大学 汇丰商学院,广东 深圳 518055
    2.对外经济贸易大学 信息学院,北京 100029
  • 出版日期:2014-08-15 发布日期:2014-08-14

Research on method to detect reduplicative Chinese short texts

GAO Xiang1, LI Bing2   

  1. 1.Peking University HSBC Business School, Shenzhen, Guangdong 518055, China
    2.School of Information Technology & Management, University of International Business and Economics, Beijing 100029, China
  • Online:2014-08-15 Published:2014-08-14

摘要: 针对中文短文本冗余问题,提出了有效的去重算法框架。考虑到短文本海量性和简短性的特点,以及中文与英文之间的区别,引入了Bloom Filter、Trie树以及SimHash算法。算法框架的第一阶段由Bloom Filter或Trie树进行完全去重,第二阶段由SimHash算法进行相似去重。设计了该算法框架的各项参数,并通过仿真实验证实了该算法框架的可行性及合理性。

关键词: 文本去重, 中文短文本, Bloom Filter, Trie树, SimHash算法

Abstract: The article presents an effective algorithm framework for text de-duplication, focusing on redundancy problem of Chinese short texts. In view of the brevity and huge volumes of short texts, Bloom Filter have been introduced, Trie tree and the SimHash algorithm have been introduced. In the first stage of the algorithm framework, Bloom Filter or Trie tree is designed to remove duplications completely;in the second stage, the SimHash algorithm is used to detect similar duplications. This text has designed the parameters used in the algorithm framework, and the feasibility and rationality is testified.

Key words: text de-duplication, Chinese short texts, Bloom Filter, Trie tree, SimHash algorithm