Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (12): 141-143.DOI: 10.3778/j.issn.1002-8331.2009.12.046

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Research on elimination of similar web pages

FAN Yong1,ZHENG Jia-heng2   

  1. 1.Department of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2.Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing,Taiyuan 030006,China
  • Received:2008-03-06 Revised:2008-05-26 Online:2009-04-21 Published:2009-04-21
  • Contact: FAN Yong

网页去重方法研究

樊 勇1,郑家恒2   

  1. 1.山西大学 计算机与信息技术学院,太原 030006
    2.计算智能与中文信息处理省部共建教育部重点实验室,太原 030006
  • 通讯作者: 樊 勇

Abstract: Similar web pages that search engine returns not only waste storage resources but also increase the burden on web users.In this paper,a method based on semantic to detect similar web pages is proposed.This method picks up topic sentence vector of web pages through location of the sentence in the text and importance of chunking.Then it detects the similar web pages by calculating semantic similar degree of topic sentence vector.The experiment results show that not only completely similar web pages are detected accurately but also partly similar web pages are detected exactly.

摘要: 搜索引擎返回的重复网页不但浪费了存储资源,而且加重了用户浏览的负担。针对网页重复的特征,提出了一种基于语义的去重方法。该方法通过句子在文本中的位置和组块的重要度,提取出网页正文的主题句向量,然后对主题句向量进行语义相似度计算,把重复的网页去除。实验证明,该方法对全文重复和部分重复的网页都能进行较准确的检测。