计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (24): 199-201.

• 图形、图像、模式识别 • 上一篇    下一篇

基于句子相似度的论文抄袭检测模型研究

冷强奎1,秦玉平1,王春立2   

  1. 1.渤海大学 信息科学与工程学院,辽宁 锦州 121000
    2.大连海事大学 信息科学技术学院,辽宁 大连 116026
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-08-21 发布日期:2011-08-21

Study on model for plagiarism-detection of scientific papers based on sentence similarity

LENG Qiangkui1,QIN Yuping1,WANG Chunli2   

  1. 1.College of Information Science and Engineering,Bohai University,Jinzhou,Liaoning 121000,China
    2.College of Information Science and Technology,Dalian Maritime University,Dalian,Liaoning 116026,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-08-21 Published:2011-08-21

摘要: 提出一种基于句子相似度的论文抄袭检测模型。利用局部词频指纹算法对大规模文档进行快速检测,找出疑似抄袭文档。根据最长有序公共子序列算法计算句子间的相似度,并标注抄袭细节,给出抄袭依据。在标准中文数据集SOGOU-T上进行的实验表明,该模型具有较强的局部信息挖掘能力,在一定程度上克服了现有的论文抄袭检测算法精度不高的缺点。

关键词: 句子相似度, 抄袭检测, 局部词频, 最长有序公共子序列

Abstract: A new model for plagiarism-identification of scientific papers based on sentence similarity is presented.Large-scale texts are quickly detected with Local Word-Frequency Fingerprint(LWFF) to find suspected plagiarism ones.Sentence similarity is computed according to the Longest Sorted Common Subsequence(LSCS) between source texts and destination texts.The algorithm can mark plagiarism details,and show evidence.The identification experiments on the SOGOU-T database are done with this model.The results show it has higher information mining capacity,and partly overcomes the shortage of lower precision on existing plagiarism-identification of scientific papers.

Key words: sentence similarity, plagiarism-detection, local word-frequency, Longest Sorted Common Subsequence(LSCS)