Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (16): 63-66.DOI: 10.3778/j.issn.1002-8331.2010.16.018

Focused crawler based on one-class document classification

FANG Jia-pei,HUANG Zhan   

  1. Department of Computer Science,Jinan University,Guangzhou 510632,China
  • Received:2009-03-31 Revised:2009-05-18 Online:2010-06-01 Published:2010-06-01
方加沛,黄 战   

  1. 暨南大学 计算机科学系,广州 510632
Abstract: There are two methods that can be used to determine the interesting topic in designing a focused crawler:setting keywords manually or constructing a classifier.Although the former can be easily implemented,it depends on the expert’s experience and has disadvantages about leaks of keywords and imprecise quantification of the keyword’s weight.The major defect of the later is that it is difficult to acquire typical negative training examples.To solve these problems,a focused crawler based on one-class document classification is proposed.The classification can act on not only the content of the web document,but also the hyperlink’s anchor text.The result of the experiments shows that the proposed focused crawler is feasible.

摘要: 主题爬虫设计中关于主题的确立可以采用手工设置关键词集的方法也可以采用构造分类器的方法。前者易于实现,但是依赖于专家的经验,具有关键词缺漏和权值量化不精确的缺点;而后者的主要缺点在于难以获取具有代表性的反例训练样本。针对上述情况,提出了一种基于单类别文档分类的主题爬虫,同时还将分类作用于超链接的anchor text。实验结果充分证明了该主题爬虫的可行性。

