基于爬虫和网站分类的主题信息源发现方法

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (3): 59-65.

基于爬虫和网站分类的主题信息源发现方法

邓厚平，武刚

北京林业大学信息学院，北京 100083

出版日期:2016-02-01 发布日期:2016-02-03

Discovery of topic-specific information source based on web crawler and website classification

DENG Houping, WU Gang

College of Information, Beijing Forestry University, Beijing 100083, China

Online:2016-02-01 Published:2016-02-03

摘要/Abstract

摘要： 如何发现主题信息源是主题Web信息整合的前提。提出了一种主题信息源发现方法，将主题信息源发现转化为网站主题分类问题，并利用站外链接发现新的信息源。从网站中提取出能反映网站主题的内容特征词和结构特征词，建立描述网站主题的改进的向量空间模型。以该模型为基础，通过类中心向量法与SVM相结合对网站主题进行分类。提出一种能尽量少爬取网页的网络搜索策略，在发现站外链接的同时爬取最能代表网站主题的页面。将该主题信息源发现方法应用于林业商务信息源，通过实验验证了该方法的有效性。

关键词: 网站主题, 特征描述, 分类, 爬虫, 信息源发现

Abstract: The discovery of topic-specific information source is the premise of Web information integration. A topic-specific information discovery method is presented, changing the problem to website topic classification and discover websites using external links. An improved VSM model is established to describe the website topic, using both content and structure features extracted from websites. Based on the improved VSM model, a classification method combining center-vector algorithm and SVM is presented to classify the topic of websites. A web search strategy aiming to minimize the quantity of crawled web page is presented to find out web pages that best represent the topic of the website. The topic-specific information source discovery method is used to find forestry business website for test and performs well.

Key words: website topic, feature description, classification, crawler, information source discovery

邓厚平，武刚. 基于爬虫和网站分类的主题信息源发现方法[J]. 计算机工程与应用, 2016, 52(3): 59-65.

DENG Houping, WU Gang. Discovery of topic-specific information source based on web crawler and website classification[J]. Computer Engineering and Applications, 2016, 52(3): 59-65.

[1]	王永贵，李倩玉. 基于KNN-GBDT的混合协同过滤推荐算法[J]. 计算机工程与应用, 2021, 57(9): 103-108.
[2]	杨春霞，李欣栩，吴佳君，刘天宇. 基于注意力交互机制的层次网络情感分类[J]. 计算机工程与应用, 2021, 57(9): 134-139.
[3]	张韩钰，吴志昊，徐勇，陈斌. 增强卷积神经网络的人脸篡改检测方法[J]. 计算机工程与应用, 2021, 57(8): 220-224.
[4]	李俊丽. Spark平台下类别数据互信息计算的并行化[J]. 计算机工程与应用, 2021, 57(7): 95-100.
[5]	韩卫宇，程龙生. 结合马田系统-SVM的滚动轴承故障模式分类研究[J]. 计算机工程与应用, 2021, 57(6): 239-246.
[6]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[7]	韩东方，吐尔地·托合提，艾斯卡尔·艾木都拉. 问答系统中问句分类方法研究综述[J]. 计算机工程与应用, 2021, 57(6): 10-21.
[8]	黄金杰，蔺江全，何勇军，何瑾洁，王雅君. 局部语义与上下文关系的中文短文本分类算法[J]. 计算机工程与应用, 2021, 57(6): 94-100.
[9]	李硕，梁毅. 面向Spark的批处理应用执行时间预测模型[J]. 计算机工程与应用, 2021, 57(5): 79-87.
[10]	王凤琴，柯亨进. 卷积神经网络及其分析在抑郁症判别中的应用[J]. 计算机工程与应用, 2021, 57(5): 245-250.
[11]	万亚玲，钟锡武，刘慧，钱育蓉. 卷积神经网络在高光谱图像分类中的应用综述[J]. 计算机工程与应用, 2021, 57(4): 1-10.
[12]	陶体伟，刘明霞，王明亮，王琳琳，杨德运，张强. 基于有效距离的低秩表示[J]. 计算机工程与应用, 2021, 57(4): 141-147.
[13]	郑诚，董春阳，黄夏炎. 基于BTM图卷积网络的短文本分类方法[J]. 计算机工程与应用, 2021, 57(4): 155-160.
[14]	佘海龙，解山娟，邹静洁. 标准分数降维的3D-CNN高光谱遥感图像分类[J]. 计算机工程与应用, 2021, 57(4): 169-175.
[15]	于多，黄永东. 基于SPCA和域变换递归滤波的高光谱图像分类[J]. 计算机工程与应用, 2021, 57(4): 199-208.