Study on duplicated removal algorithm web pages based on elements of news subject

Computer Engineering and Applications ›› 2007, Vol. 43 ›› Issue (28): 177-180.

• 数据库与信息处理 • Previous Articles Next Articles

Study on duplicated removal algorithm web pages based on elements of news subject

WANG Peng,ZHANG Yong-kui,ZHANG Yan,LIU Rui

School of Computer & Information Technology,Shanxi University,Taiyuan 030006,China
Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing,Taiyuan 030006 China

Received:1900-01-01 Revised:1900-01-01 Online:2007-10-01 Published:2007-10-01
Contact: WANG Peng

基于新闻网页主题要素的网页去重方法研究

王鹏,张永奎,张彦,刘睿

山西大学计算机与信息技术学院,太原 030006
计算智能与中文信息处理省部共建教育部重点实验室太原 030006

通讯作者: 王鹏

Abstract

Abstract: In the homepage retrieval result,the user can obtain the content same redundant page frequently.This article proposes one kind of duplicated news web pages removal algorithm though study news content on elements of news subject.This method basic thought is：First,extracts the time and the place phrase which in the news essential factor the event occurs;Then,through extraction time and place phrase extraction news content;Finally,through calculates their similarity according to the study news content to judge the news homepage the heavy multiplicity.The experimental result indicates that,this method can complete in view of the news content duplicated web pages,and obtains the high recall and the accuracy ratio.

Key words: elements of news subject, fuzzy matching, duplicate removal algorithm

摘要： 网页检索结果中,用户经常会得到内容相同的冗余页面。提出了一种通过新闻主题要素学习新闻内容的新闻网页去重算法。该方法的基本思想是：首先,抽取新闻要素中关于事件发生的时间和地点短语;然后,通过抽取的时间和地点短语抽取新闻的内容;最终,根据学习的新闻内容通过计算它们的相似度来判断新闻网页的重复度。实验结果表明,该方法能够完成针对新闻内容的新闻网页的去重,并得到较高的查全率和查准率。

关键词: 新闻主题要素, 模糊匹配, 去重算法

WANG Peng,ZHANG Yong-kui,ZHANG Yan,LIU Rui. Study on duplicated removal algorithm web pages based on elements of news subject[J]. Computer Engineering and Applications, 2007, 43(28): 177-180.

王鹏,张永奎,张彦,刘睿. 基于新闻网页主题要素的网页去重方法研究[J]. 计算机工程与应用, 2007, 43(28): 177-180.

[1]	ZHOU Meng, CHEN Yuedong, CHEN Mengyuan. Improvement of LEACH route protocol based on optimal energy consumption [J]. Computer Engineering and Applications, 2014, 50(23): 82-86.
[2]	WANG Yu-xin，TIAN Jia，GUO He，WU Shu-peng，YANG Yuan-sheng. Research on design pattern mining strategy based on fuzzy method [J]. Computer Engineering and Applications, 2010, 46(2): 150-153.

Study on duplicated removal algorithm web pages based on elements of news subject

基于新闻网页主题要素的网页去重方法研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 2

Recommended Articles

Metrics