Research on elimination of similar web pages

doi:10.3778/j.issn.1002-8331.2009.12.046

Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (12): 141-143.DOI: 10.3778/j.issn.1002-8331.2009.12.046

• 数据库、信号与信息处理 • Previous Articles Next Articles

Research on elimination of similar web pages

FAN Yong¹,ZHENG Jia-heng²

1.Department of Computer and Information Technology，Shanxi University，Taiyuan 030006，China
2.Key Laboratory of Ministry of Education for Computation Intelligence and Chinese Information Processing，Taiyuan 030006，China

Received:2008-03-06 Revised:2008-05-26 Online:2009-04-21 Published:2009-04-21
Contact: FAN Yong

网页去重方法研究

樊勇¹,郑家恒²

1.山西大学计算机与信息技术学院，太原 030006
2.计算智能与中文信息处理省部共建教育部重点实验室，太原 030006

通讯作者: 樊勇

Abstract

Abstract: Similar web pages that search engine returns not only waste storage resources but also increase the burden on web users.In this paper，a method based on semantic to detect similar web pages is proposed.This method picks up topic sentence vector of web pages through location of the sentence in the text and importance of chunking.Then it detects the similar web pages by calculating semantic similar degree of topic sentence vector.The experiment results show that not only completely similar web pages are detected accurately but also partly similar web pages are detected exactly.

摘要： 搜索引擎返回的重复网页不但浪费了存储资源，而且加重了用户浏览的负担。针对网页重复的特征，提出了一种基于语义的去重方法。该方法通过句子在文本中的位置和组块的重要度，提取出网页正文的主题句向量，然后对主题句向量进行语义相似度计算，把重复的网页去除。实验证明，该方法对全文重复和部分重复的网页都能进行较准确的检测。

FAN Yong¹,ZHENG Jia-heng². Research on elimination of similar web pages[J]. Computer Engineering and Applications, 2009, 45(12): 141-143.

樊勇¹,郑家恒². 网页去重方法研究[J]. 计算机工程与应用, 2009, 45(12): 141-143.

Research on elimination of similar web pages

网页去重方法研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 0

Recommended Articles

Metrics