计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (11): 15-18.

• 博士论坛 • 上一篇    下一篇

基于语言节奏的大规模文档去重算法研究

陈 钒1,2,冯志勇1,李晓红1,赵 庚3   

  1. 1.天津大学 计算机科学与软件学院,天津 300072
    2.天津财经大学 理工学院 信息科学与技术系,天津 300200
    3.河北工业大学,天津 300130
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-04-11 发布日期:2011-04-11

Study on large scale duplicated text deletion algorithm based on language cadence

CHEN Fan1,2,FENG Zhiyong1,LI Xiaohong1,ZHAO Geng3   

  1. 1.School of Computer Science and Technology,Tianjin University,Tianjin 300072,China
    2.Dept. of Info. Science & Technology,College of Science,Tianjin University of Finance and Economics,Tianjin 300200,China
    3.Hebei University of Technology,Tianjin 300130,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-04-11 Published:2011-04-11

摘要: 通过对Web中大规模文档研究发现,文档中的自然段落具有特殊的语言节奏。提出了一种基于语言节奏的文档重复性检测方法,通过构建文档中自然段落的语言节奏码并进行重复性分析,实现了基于段粒度的文档重复性检测。实验表明,此方法具有良好的召回率和准确率,可以将内容完全重复的文档、部分段落内容重复的文档及打乱段落顺序重组文档的重复性均检测出来,检测精度高且占用系统资源少。

关键词: 文档重复性检测, 语言节奏, 标点

Abstract: It is found that language cadence can mark the text uniquely by studying on large scale text in Web.The large scale duplicated text detection algorithm based on language cadence is prompted here.It has higher precision rate and efficiency that the algorithm based on semantic and text structure.Punctuations can mark the basic language cadence of each text.This cadence can be caught for creating the language cadence code of every paragraph in text,in order to detect the duplicate one quickly and easily.The experiments’ result shows that this algorithm has good recall and precision rate in duplicated paragraph detection.It can find the duplicated content not only of page but also of paragraph.So it can detect the duplicated in content with different paragraph sequence.

Key words: duplicated text detection, language cadence, punctuation