计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (2): 116-119.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

主题搜索引擎中爬虫搜索策略的研究

史宝明1,贺元香1,吴崇正2   

  1. 1.兰州文理学院 电子信息工程学院,兰州 730000
    2.兰州理工大学 计算机与通信学院,兰州 730050
  • 出版日期:2014-01-15 发布日期:2014-01-26

Research on search strategy of web spider in topic-oriented search engines

SHI Baoming1, HE Yuanxiang1, WU Chongzheng2   

  1. 1.School of Electronics and Information Engineering, Lanzhou University of Arts and Science, Lanzhou 730000, China
    2.School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2014-01-15 Published:2014-01-26

摘要: 为了解决传统主题爬虫效率偏低的问题,传统主题爬虫会选择最有价值的链接进行访问,仅简单地计算链接的相关性,却忽视待分析URL之间的相关性关系,致使主题爬虫爬取效率较低。提出一种基于链接模型的相关性判别算法,综合利用有标种子URL和无标的待判别URL实现对无标URL的相关性判别,并推导出迭代初值选取对结果的不敏感性。实验结果表明,与传统的网络爬虫算法相关性判别方法相比,提出的方法效率更高。

关键词: 网络爬虫, 主题搜索引擎, 搜索策略, 向量空间模型

Abstract: In order to solve the low efficiency problem of traditional focused crawler, web spider always selects the most valuable links to visit, so how to focus the search around a given topic is a key problem. The traditional method always only computes the relevance of the links, but ignores the relevance among the unlabeled URL, now it proposes the algorithm based on link model which combines the seed URL with unlabeled URL to compute the relevance of the other URL, and it deduces the point that initial iterative is insensitivity of the results. Compared with the methods based on traditional algorithm, experimental result proves the performance of the new algorithm is more efficient than the traditional ones.

Key words: web spider, topic-oriented search engine, search strategy, Vector Space Model(VSM)