计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (28): 177-180.

• 数据库与信息处理 • 上一篇    下一篇

基于新闻网页主题要素的网页去重方法研究

王 鹏,张永奎,张 彦,刘 睿   

  1. 山西大学 计算机与信息技术学院,太原 030006
    计算智能与中文信息处理省部共建教育部重点实验室 太原 030006
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-10-01 发布日期:2007-10-01
  • 通讯作者: 王 鹏

Study on duplicated removal algorithm web pages based on elements of news subject

WANG Peng,ZHANG Yong-kui,ZHANG Yan,LIU Rui   

  1. School of Computer & Information Technology,Shanxi University,Taiyuan 030006,China
    Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing,Taiyuan 030006 China
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-10-01 Published:2007-10-01
  • Contact: WANG Peng

摘要: 网页检索结果中,用户经常会得到内容相同的冗余页面。提出了一种通过新闻主题要素学习新闻内容的新闻网页去重算法。该方法的基本思想是:首先,抽取新闻要素中关于事件发生的时间和地点短语;然后,通过抽取的时间和地点短语抽取新闻的内容;最终,根据学习的新闻内容通过计算它们的相似度来判断新闻网页的重复度。实验结果表明,该方法能够完成针对新闻内容的新闻网页的去重,并得到较高的查全率和查准率。

关键词: 新闻主题要素, 模糊匹配, 去重算法

Abstract: In the homepage retrieval result,the user can obtain the content same redundant page frequently.This article proposes one kind of duplicated news web pages removal algorithm though study news content on elements of news subject.This method basic thought is:First,extracts the time and the place phrase which in the news essential factor the event occurs;Then,through extraction time and place phrase extraction news content;Finally,through calculates their similarity according to the study news content to judge the news homepage the heavy multiplicity.The experimental result indicates that,this method can complete in view of the news content duplicated web pages,and obtains the high recall and the accuracy ratio.

Key words: elements of news subject, fuzzy matching, duplicate removal algorithm