计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (6): 119-121.

• 网络、通信与安全 • 上一篇    下一篇

基于发布时间的新闻网页去重方法研究

罗永莲 张永奎   

  1. 晋中学院 山西大学计算机科学系
  • 收稿日期:2006-06-13 修回日期:1900-01-01 出版日期:2007-02-21 发布日期:2007-02-21
  • 通讯作者: 罗永莲

Research on Duplicated News Webpages Deletion Method Based on The Issue Time

YongLian Luo   

  • Received:2006-06-13 Revised:1900-01-01 Online:2007-02-21 Published:2007-02-21
  • Contact: YongLian Luo

摘要: 网页检索结果中,用户经常会得到内容相同的冗余页面。它们不但浪费了存储资源,而且给信息检索或其它文本处理带来诸多不便。本文在抽取出新闻标题、主题内容和发布日期的前提下,依据新闻的时间性(易碎性),按发布日期分“群”,对冗余网页去重方法进行了探索性研究,从而很大程度地缩小了计算时间,提高了去重准确性。

关键词: 新闻网页, 主题内容抽取, 网页去重, 权值计算

Abstract: In the homepage retrieval result, users often get the redundant page with same content. It has not only wasted the storing resources, but also brought a great deal of inconvenience to information retrieval or other text-processing. We first extract the news title, the subject content and the issue date in this article, then divide group according to data issued on the basis of news fragility and conduct the exploration research to duplicated web pages removal. It greatly reduced the computing time, enhanced the duplicated news webpages deletion accuracy.

Key words: news webpages, theme's extraction, duplicated web pages removal, weight calculating