计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (24): 120-125.

• 网络、通信、安全 • 上一篇    下一篇

基于预期剩余能量模型的聚焦爬行方法

尹文科,宗士强,王  珩   

  1. 中国电子科技集团公司 第二十八研究所 信息系统工程重点实验室,南京 210007
  • 出版日期:2015-12-15 发布日期:2015-12-30

Expected residual energy based focused crawling method

YIN Wenke, ZONG Shiqiang, WANG Heng   

  1. Science and Technology on Information Systems Engineering Laboratory, 28th Institute, China Electronics Technology Corporation(CETC), Nanjing 210007, China
  • Online:2015-12-15 Published:2015-12-30

摘要: 如何确定搜索的方向和深度是聚焦爬行的核心问题。为此,提出了链接的预期剩余能量概念及其计算方法。该方法利用当前页面的信息计算链接的立即回报能量,利用到达同一链接不同历史路径给予的历史回报知识不断迭代更新链接的预期剩余能量。利用预期剩余能量作为链接的优先级和搜索深度限制,设计了基于预期剩余能量模型的聚焦爬行算法,并给出了关键模块的实现。实验结果显示该方法具有更强的主题网站发现能力。

关键词: 聚焦爬行, 搜索方向, 搜索深度, 主题相关度, 预期剩余能量

Abstract: How to determine the search direction and depth are the key problem of focused crawling. This paper proposes an expected residual energy based URL priority computing method. This method uses the information of the current web page to calculate the immediately returning energy of hyperlinks, and then updates the expected residual energy using the historical returning knowledge of different historical paths in an iterative way. Using the expected residual energy as the priority and depth limit, this paper presents the system architecture of the expected residual energy based focused crawler, and gives out the detailed implementation of the key modules. Experiment result shows the focused crawler acquires better topic relevant websites finding ability.

Key words: focused crawling, search direction, search depth, topic relevance, expected residual energy