Discovery of topic-specific information source based on web crawler and website classification

Abstract

Abstract: The discovery of topic-specific information source is the premise of Web information integration. A topic-specific information discovery method is presented, changing the problem to website topic classification and discover websites using external links. An improved VSM model is established to describe the website topic, using both content and structure features extracted from websites. Based on the improved VSM model, a classification method combining center-vector algorithm and SVM is presented to classify the topic of websites. A web search strategy aiming to minimize the quantity of crawled web page is presented to find out web pages that best represent the topic of the website. The topic-specific information source discovery method is used to find forestry business website for test and performs well.

Key words: website topic, feature description, classification, crawler, information source discovery

摘要： 如何发现主题信息源是主题Web信息整合的前提。提出了一种主题信息源发现方法，将主题信息源发现转化为网站主题分类问题，并利用站外链接发现新的信息源。从网站中提取出能反映网站主题的内容特征词和结构特征词，建立描述网站主题的改进的向量空间模型。以该模型为基础，通过类中心向量法与SVM相结合对网站主题进行分类。提出一种能尽量少爬取网页的网络搜索策略，在发现站外链接的同时爬取最能代表网站主题的页面。将该主题信息源发现方法应用于林业商务信息源，通过实验验证了该方法的有效性。

关键词: 网站主题, 特征描述, 分类, 爬虫, 信息源发现

DENG Houping, WU Gang. Discovery of topic-specific information source based on web crawler and website classification[J]. Computer Engineering and Applications, 2016, 52(3): 59-65.

邓厚平，武刚. 基于爬虫和网站分类的主题信息源发现方法[J]. 计算机工程与应用, 2016, 52(3): 59-65.

[1]	YANG Chunxia, LI Xinxu, WU Jiajun, LIU Tianyu. Hierarchical Network Sentiment Classification Based on Attention Interaction Mechanism [J]. Computer Engineering and Applications, 2021, 57(9): 134-139.
[2]	ZHANG Hanyu, WU Zhihao, XU Yong, CHEN Bin. Face Forensics Detection Method Based on Enhanced Convolutional Neural Networks [J]. Computer Engineering and Applications, 2021, 57(8): 220-224.
[3]	HAN Weiyu, CHENG Longsheng. Research on Roling Bearing Failure Mode Classification Based on MTS and SVM [J]. Computer Engineering and Applications, 2021, 57(6): 239-246.
[4]	HUO Guangyu, ZHANG Yong, SUN Yanfeng, YIN Baocai. Research on Archive Data Intelligent Classification Based on Semantic [J]. Computer Engineering and Applications, 2021, 57(6): 247-253.
[5]	HAN Dongfang, Turdy Toheti, Askar Hamdulla. Survey on Question Classification Method in Question Answering System [J]. Computer Engineering and Applications, 2021, 57(6): 10-21.
[6]	HUANG Jinjie, LIN Jiangquan, HE Yongjun, HE Jinjie, WANG Yajun. Chinese Short Text Classification Algorithm Based on Local Semantics and Context [J]. Computer Engineering and Applications, 2021, 57(6): 94-100.
[7]	LI Shuo, LIANG Yi. Prediction Model of Execution Time for Batch Application in Spark [J]. Computer Engineering and Applications, 2021, 57(5): 79-87.
[8]	WANG Fengqin, KE Hengjin. Application of CNN and Its Analysis in Depression Identification [J]. Computer Engineering and Applications, 2021, 57(5): 245-250.
[9]	WAN Yaling, ZHONG Xiwu, LIU Hui, QIAN Yurong. Survey of Application of Convolutional Neural Network in Classification of Hyperspectral Images [J]. Computer Engineering and Applications, 2021, 57(4): 1-10.
[10]	TAO Tiwei, LIU Mingxia, WANG Mingliang, WANG Linlin, YANG Deyun, ZHANG Qiang. Effective Distance Based Low-Rank Representation [J]. Computer Engineering and Applications, 2021, 57(4): 141-147.
[11]	ZHENG Cheng, DONG Chunyang, HUANG Xiayan. Short Text Classification Method Based on BTM Graph Convolutional Network [J]. Computer Engineering and Applications, 2021, 57(4): 155-160.
[12]	SHE Hailong, XIE Shanjuan, ZOU Jingjie. 3D-CNN with Standard Score Dimensionality Reduction for Hyperspectral Remote Sensing Images Classification [J]. Computer Engineering and Applications, 2021, 57(4): 169-175.
[13]	YU Duo, HUANG Yongdong. Hyperspectral Image Classification Based on SPCA and Domain Transform Recursive Filtering [J]. Computer Engineering and Applications, 2021, 57(4): 199-208.
[14]	HU Jie, ZHANG Ying, XIE Shiyi. Summary of Research Progress on Application of Domestic Remote Sensing Image Classification Technology [J]. Computer Engineering and Applications, 2021, 57(3): 1-13.
[15]	HE Wenliang, ZHU Minling. Research Status and Future Analysis of Capsule Neural Network [J]. Computer Engineering and Applications, 2021, 57(3): 33-43.

Discovery of topic-specific information source based on web crawler and website classification

基于爬虫和网站分类的主题信息源发现方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics