Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (29): 23-26.

• 博士论坛 • Previous Articles     Next Articles

Improved crawler algorithm technique for P2P specific information

DING Junping,CAI Wandong   

  1. College of Computer Science,Northwestern Polytechnical University,Xi’an 710072,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-10-11 Published:2011-10-11

面向P2P特定信息的爬虫改进技术

丁军平,蔡皖东   

  1. 西北工业大学 计算机学院,西安 710072

Abstract: Current topic crawler algorithm technique can crawl lots of uncorrelated websites during obtaining of the “meta-information”,so the current topic crawler algorithm technique has been improved by being added URL classification algorithm.This classification algorithm,based on the supplied URL sample information,generates multiple uncorrelated URL key words sets and “meta-information” URL key words sets.It sets up power to the key words in the set,and sets the threshold value to all sets;describes URL by feature vector,and calculates the distance with the key words set to classify URL;analyzes the algorithm performance in detail.As the test indicates,compared with the traditional topic crawler technique,the improved technique can dramatically improve the efficiency during obtaining of the “meta-information”.The obtained “meta-information” quantity can be improved by 96.21% in the same time,which can fully meet the performance requirement of initiative monitoring model to crawler.

Key words: “meta-information” obtaining, topic crawler technique, URL classification algorithm, feature vector representation, initiative monitoring model

摘要: 针对现有主题爬虫技术在获取“元信息”时会抓取大量不相关网页的问题,对现有主题爬虫技术进行改进,加入了URL分类技术。该分类方法根据提供的URL样本信息,生成多个不相关URL关键词集合以及“元信息”URL关键词集合;对集合中的关键词设置权限信息,设置集合的分类判断阈值;将URL使用特征向量表示,计算与关键词集合的距离,对URL进行分类;对算法性能进行了详细分析。实验结果表明,所提方法在进行“元信息”获取时,与传统主题爬虫技术相比能够大幅度提高效率,在相同时间内,“元信息”获取数量可增加96.21%,完全能够满足主动监测模型对网络爬虫的性能要求。

关键词: “元信息”获取, 主题爬虫技术, URL分类算法, 特征向量表示, 主动监测模型