网页去重方法研究

doi:10.3778/j.issn.1002-8331.2009.12.046

计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (12): 141-143.DOI: 10.3778/j.issn.1002-8331.2009.12.046

• 数据库、信号与信息处理 • 上一篇下一篇

网页去重方法研究

樊勇¹,郑家恒²

1.山西大学计算机与信息技术学院，太原 030006
2.计算智能与中文信息处理省部共建教育部重点实验室，太原 030006

收稿日期:2008-03-06 修回日期:2008-05-26 出版日期:2009-04-21 发布日期:2009-04-21
通讯作者: 樊勇

Research on elimination of similar web pages

FAN Yong¹,ZHENG Jia-heng²

1.Department of Computer and Information Technology，Shanxi University，Taiyuan 030006，China
2.Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing，Taiyuan 030006，China

Received:2008-03-06 Revised:2008-05-26 Online:2009-04-21 Published:2009-04-21
Contact: FAN Yong

摘要/Abstract

摘要： 搜索引擎返回的重复网页不但浪费了存储资源，而且加重了用户浏览的负担。针对网页重复的特征，提出了一种基于语义的去重方法。该方法通过句子在文本中的位置和组块的重要度，提取出网页正文的主题句向量，然后对主题句向量进行语义相似度计算，把重复的网页去除。实验证明，该方法对全文重复和部分重复的网页都能进行较准确的检测。

Abstract: Similar web pages that search engine returns not only waste storage resources but also increase the burden on web users.In this paper，a method based on semantic to detect similar web pages is proposed.This method picks up topic sentence vector of web pages through location of the sentence in the text and importance of chunking.Then it detects the similar web pages by calculating semantic similar degree of topic sentence vector.The experiment results show that not only completely similar web pages are detected accurately but also partly similar web pages are detected exactly.

樊勇¹,郑家恒². 网页去重方法研究[J]. 计算机工程与应用, 2009, 45(12): 141-143.

FAN Yong¹,ZHENG Jia-heng². Research on elimination of similar web pages[J]. Computer Engineering and Applications, 2009, 45(12): 141-143.

网页去重方法研究

Research on elimination of similar web pages

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics