Focused crawler based on one-class document classification

doi:10.3778/j.issn.1002-8331.2010.16.018

Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (16): 63-66.DOI: 10.3778/j.issn.1002-8331.2010.16.018

• 研发、设计、测试 • Previous Articles Next Articles

Focused crawler based on one-class document classification

FANG Jia-pei，HUANG Zhan

Department of Computer Science，Jinan University，Guangzhou 510632，China

Received:2009-03-31 Revised:2009-05-18 Online:2010-06-01 Published:2010-06-01
Contact: FANG Jia-pei

基于单类别文档分类的主题爬虫

方加沛，黄战

暨南大学计算机科学系，广州 510632

通讯作者: 方加沛

Abstract

Abstract: There are two methods that can be used to determine the interesting topic in designing a focused crawler：setting keywords manually or constructing a classifier.Although the former can be easily implemented，it depends on the expert’s experience and has disadvantages about leaks of keywords and imprecise quantification of the keyword’s weight.The major defect of the later is that it is difficult to acquire typical negative training examples.To solve these problems，a focused crawler based on one-class document classification is proposed.The classification can act on not only the content of the web document，but also the hyperlink’s anchor text.The result of the experiments shows that the proposed focused crawler is feasible.

摘要： 主题爬虫设计中关于主题的确立可以采用手工设置关键词集的方法也可以采用构造分类器的方法。前者易于实现，但是依赖于专家的经验，具有关键词缺漏和权值量化不精确的缺点；而后者的主要缺点在于难以获取具有代表性的反例训练样本。针对上述情况，提出了一种基于单类别文档分类的主题爬虫，同时还将分类作用于超链接的anchor text。实验结果充分证明了该主题爬虫的可行性。

CLC Number:

TP311

FANG Jia-pei，HUANG Zhan. Focused crawler based on one-class document classification[J]. Computer Engineering and Applications, 2010, 46(16): 63-66.

方加沛，黄战. 基于单类别文档分类的主题爬虫[J]. 计算机工程与应用, 2010, 46(16): 63-66.

[1]	ZENG Zonggen. Source code online judge system technological improvements [J]. Computer Engineering and Applications, 2011, 47(4): 68-71.
[2]	QIN Zunyue¹，TANG Yong²，XU Hongzhi¹，ZHUO Yueming³. Novel updating computation for XML document [J]. Computer Engineering and Applications, 2011, 47(4): 121-123.
[3]	LIU Yulu¹，FANG Gang¹，TANG Shuguang². Algorithm of spatial association rules mining based on complementary location order [J]. Computer Engineering and Applications, 2011, 47(4): 134-137.
[4]	CHENG Zhuanliu^1，2，HU Weicheng². Clustering for probabilistic data stream over sliding windows [J]. Computer Engineering and Applications, 2011, 47(4): 141-145.
[5]	WANG Fudong，MA Yufang. Research of method for customer segment based on data mining [J]. Computer Engineering and Applications, 2011, 47(4): 215-218.
[6]	WEN Guofeng^1，2，CHEN Liwen¹. Building and application of data warehouse of completed construction projects [J]. Computer Engineering and Applications, 2011, 47(4): 245-248.
[7]	QIN Hua，XU Yanzi. Optimization of large-scale SVM using path following method & kernel distance matrix [J]. Computer Engineering and Applications, 2011, 47(3): 160-162.
[8]	NING Jing¹，ZHU Changqian²，GAO Pinxian¹，LIN Jianhui³. Research on extending end data in empirical mode decomposition [J]. Computer Engineering and Applications, 2011, 47(3): 125-128.
[9]	ZHANG Lei，LI Rui，TAO Liang. Software program design for 1-D fast Gabor transform based on Visual C++ [J]. Computer Engineering and Applications, 2011, 47(3): 67-69.
[10]	ZENG Min，HUANG Ying. Research on optimal software testing case based on self learning control algorithm [J]. Computer Engineering and Applications, 2011, 47(3): 70-73.
[11]	FU Yafang，LIU Xiaodong，LI Yanjie. Software size estimation method based on improved FPA [J]. Computer Engineering and Applications, 2011, 47(1): 22-25.
[12]	YUAN Yi，TIAN Xiaolin，XIA Shaowei. Blind audio digital watermarking method based on power comparison [J]. Computer Engineering and Applications, 2011, 47(1): 131-134.
[13]	ZHU Yingwen¹，JI Genlin²，SUN Qinhong¹. GML document structural clustering algorithm based on frequent subtree patterns [J]. Computer Engineering and Applications, 2011, 47(1): 144-146.
[14]	MU Yongmin，JIANG Zhiying，ZHANG Zhihua. Path extraction based on C program instrumentation [J]. Computer Engineering and Applications, 2011, 47(1): 67-69.
[15]	ZHENG Liang¹，ZHU Li-gu¹，ZHAO Ting-tao¹，YI Qi²，YAN Chu-ping²，HU Huai-xiang²，YANG Fan³. Design of evaluation method of storage virtualization [J]. Computer Engineering and Applications, 2010, 46(36): 98-100.

Focused crawler based on one-class document classification

基于单类别文档分类的主题爬虫

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics