基于新闻网页主题要素的网页去重方法研究

计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (28): 177-180.

基于新闻网页主题要素的网页去重方法研究

王鹏,张永奎,张彦,刘睿

山西大学计算机与信息技术学院,太原 030006
计算智能与中文信息处理省部共建教育部重点实验室太原 030006

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-10-01 发布日期:2007-10-01
通讯作者: 王鹏

Study on duplicated removal algorithm web pages based on elements of news subject

WANG Peng,ZHANG Yong-kui,ZHANG Yan,LIU Rui

School of Computer & Information Technology,Shanxi University,Taiyuan 030006,China
Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing,Taiyuan 030006 China

Received:1900-01-01 Revised:1900-01-01 Online:2007-10-01 Published:2007-10-01
Contact: WANG Peng

摘要/Abstract

摘要： 网页检索结果中,用户经常会得到内容相同的冗余页面。提出了一种通过新闻主题要素学习新闻内容的新闻网页去重算法。该方法的基本思想是：首先,抽取新闻要素中关于事件发生的时间和地点短语;然后,通过抽取的时间和地点短语抽取新闻的内容;最终,根据学习的新闻内容通过计算它们的相似度来判断新闻网页的重复度。实验结果表明,该方法能够完成针对新闻内容的新闻网页的去重,并得到较高的查全率和查准率。

关键词: 新闻主题要素, 模糊匹配, 去重算法

Abstract: In the homepage retrieval result,the user can obtain the content same redundant page frequently.This article proposes one kind of duplicated news web pages removal algorithm though study news content on elements of news subject.This method basic thought is：First,extracts the time and the place phrase which in the news essential factor the event occurs;Then,through extraction time and place phrase extraction news content;Finally,through calculates their similarity according to the study news content to judge the news homepage the heavy multiplicity.The experimental result indicates that,this method can complete in view of the news content duplicated web pages,and obtains the high recall and the accuracy ratio.

Key words: elements of news subject, fuzzy matching, duplicate removal algorithm

王鹏,张永奎,张彦,刘睿. 基于新闻网页主题要素的网页去重方法研究[J]. 计算机工程与应用, 2007, 43(28): 177-180.

WANG Peng,ZHANG Yong-kui,ZHANG Yan,LIU Rui. Study on duplicated removal algorithm web pages based on elements of news subject[J]. Computer Engineering and Applications, 2007, 43(28): 177-180.

[1]	周萌，陈跃东，陈孟元. 能耗最优的LEACH协议改进[J]. 计算机工程与应用, 2014, 50(23): 82-86.
[2]	王宇新，田佳，郭禾，吴树朋，杨元生. 应用模糊方法的设计模式挖掘策略研究[J]. 计算机工程与应用, 2010, 46(2): 150-153.