计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (12): 141-143.DOI: 10.3778/j.issn.1002-8331.2009.12.046

• 数据库、信号与信息处理 • 上一篇    下一篇

网页去重方法研究

樊 勇1,郑家恒2   

  1. 1.山西大学 计算机与信息技术学院,太原 030006
    2.计算智能与中文信息处理省部共建教育部重点实验室,太原 030006
  • 收稿日期:2008-03-06 修回日期:2008-05-26 出版日期:2009-04-21 发布日期:2009-04-21
  • 通讯作者: 樊 勇

Research on elimination of similar web pages

FAN Yong1,ZHENG Jia-heng2   

  1. 1.Department of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2.Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing,Taiyuan 030006,China
  • Received:2008-03-06 Revised:2008-05-26 Online:2009-04-21 Published:2009-04-21
  • Contact: FAN Yong

摘要: 搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征,提出了一种基于语义的去重方法。该方法通过句子在文本中的位置和组块的重要度,提取出网页正文的主题句向量,然后对主题句向量进行语义相似度计算,把重复的网页去除。实验证明,该方法对全文重复和部分重复的网页都能进行较准确的检测。

Abstract: Similar web pages that search engine returns not only waste storage resources but also increase the burden on web users.In this paper,a method based on semantic to detect similar web pages is proposed.This method picks up topic sentence vector of web pages through location of the sentence in the text and importance of chunking.Then it detects the similar web pages by calculating semantic similar degree of topic sentence vector.The experiment results show that not only completely similar web pages are detected accurately but also partly similar web pages are detected exactly.