主题搜索引擎中爬虫搜索策略的研究

计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (2): 116-119.

• 数据库、数据挖掘、机器学习 • 上一篇下一篇

主题搜索引擎中爬虫搜索策略的研究

史宝明1，贺元香1，吴崇正2

1.兰州文理学院电子信息工程学院，兰州 730000
2.兰州理工大学计算机与通信学院，兰州 730050

出版日期:2014-01-15 发布日期:2014-01-26

Research on search strategy of web spider in topic-oriented search engines

SHI Baoming1, HE Yuanxiang1, WU Chongzheng2

1.School of Electronics and Information Engineering, Lanzhou University of Arts and Science, Lanzhou 730000, China
2.School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China

Online:2014-01-15 Published:2014-01-26

摘要/Abstract

摘要： 为了解决传统主题爬虫效率偏低的问题，传统主题爬虫会选择最有价值的链接进行访问，仅简单地计算链接的相关性，却忽视待分析URL之间的相关性关系，致使主题爬虫爬取效率较低。提出一种基于链接模型的相关性判别算法，综合利用有标种子URL和无标的待判别URL实现对无标URL的相关性判别，并推导出迭代初值选取对结果的不敏感性。实验结果表明，与传统的网络爬虫算法相关性判别方法相比，提出的方法效率更高。

关键词: 网络爬虫, 主题搜索引擎, 搜索策略, 向量空间模型

Abstract: In order to solve the low efficiency problem of traditional focused crawler, web spider always selects the most valuable links to visit, so how to focus the search around a given topic is a key problem. The traditional method always only computes the relevance of the links, but ignores the relevance among the unlabeled URL, now it proposes the algorithm based on link model which combines the seed URL with unlabeled URL to compute the relevance of the other URL, and it deduces the point that initial iterative is insensitivity of the results. Compared with the methods based on traditional algorithm, experimental result proves the performance of the new algorithm is more efficient than the traditional ones.

Key words: web spider, topic-oriented search engine, search strategy, Vector Space Model（VSM）

史宝明1，贺元香1，吴崇正2. 主题搜索引擎中爬虫搜索策略的研究[J]. 计算机工程与应用, 2014, 50(2): 116-119.

SHI Baoming1, HE Yuanxiang1, WU Chongzheng2. Research on search strategy of web spider in topic-oriented search engines[J]. Computer Engineering and Applications, 2014, 50(2): 116-119.

[1]	张子然，黄卫华，陈阳，章政，李梓远. 基于双向搜索的改进蚁群路径规划算法[J]. 计算机工程与应用, 2021, 57(21): 270-277.
[2]	郝翔，贺毅朝，朱晓斌，翟庆雷. 基于离散混合多宇宙算法求解折扣{0-1}背包问题[J]. 计算机工程与应用, 2021, 57(18): 103-113.
[3]	郭佳丽，王秋萍，王晓峰. 融合学习策略和邻域搜索的飞蛾火焰算法[J]. 计算机工程与应用, 2021, 57(12): 170-179.
[4]	韩邦，李子臣，汤永利. 基于同态加密的全文检索方案设计与实现[J]. 计算机工程与应用, 2020, 56(21): 103-107.
[5]	李郅琴，杜建强，聂斌，熊旺平，黄灿奕，李欢. 特征选择方法综述[J]. 计算机工程与应用, 2019, 55(24): 10-19.
[6]	宋晓宇，高明海，赵明. 具有自适应搜索策略的混合人工蜂群算法[J]. 计算机工程与应用, 2019, 55(22): 53-59.
[7]	叶雪梅1，2，毛雪岷1，2，夏锦春1，2，王波1，2. 文本分类TF-IDF算法的改进研究[J]. 计算机工程与应用, 2019, 55(2): 104-109.
[8]	俞武扬，周洋. 改进柔性隔间结构的不等形面积设施布局研究[J]. 计算机工程与应用, 2019, 55(14): 221-227.
[9]	向广利，李安康，林香，熊彬. 基于同态加密的多关键词检索方案[J]. 计算机工程与应用, 2018, 54(2): 97-101.
[10]	张文鹏，王兴. 改进型蝙蝠算法在作业车间调度问题中的应用[J]. 计算机工程与应用, 2017, 53(8): 137-140.
[11]	张绍阳，曹家波，王子凡，曲卫东. 基于加权二部图匹配的中文段落相似度计算[J]. 计算机工程与应用, 2017, 53(18): 95-101.
[12]	赖学方，贺兴时. 最小冗余最大分离准则特征选择方法[J]. 计算机工程与应用, 2017, 53(12): 70-75.
[13]	程玉胜1，2，梁辉2，王一宾1，2，任勇2. 结合关键词微变和LD算法的文本相似性研究[J]. 计算机工程与应用, 2016, 52(8): 70-73.
[14]	李宏霞，庞晓琼. 支持多关键字分级的可搜索同态加密方案[J]. 计算机工程与应用, 2016, 52(22): 93-98.
[15]	鲁宇明1，4，王彦超2，刘嘉瑞3，Wu Liu4. 一种改进的生物地理学优化算法[J]. 计算机工程与应用, 2016, 52(17): 146-151.

主题搜索引擎中爬虫搜索策略的研究

Research on search strategy of web spider in topic-oriented search engines

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics