Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (11): 82-87.

Previous Articles     Next Articles

NQPC:novel query log-based web-page classification method

LIU Xiangtao1,2, LIU Shuliang3   

  1. 1.Guangdong Electronics Industry Institute, Dongguan, Guangdong 523808, China
    2.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
    3.IZP Technology Co., Ltd., Beijing 100081, China
  • Online:2012-04-11 Published:2012-04-16

NQPC:一种新型的基于查询日志的网页分类方法

刘祥涛1,2,刘书良3   

  1. 1.广东电子工业研究院,广东 东莞 523808
    2.中国科学院 计算技术研究所,北京 100190
    3.亿赞普科技有限公司,北京 100081

Abstract: Web-page classification can be utilized to categorize massive web-pages and thus can be utilized in lots of areas. There are quite a few existing automatic web-page classification methods, among which there is large performance improvement space for the commonly-used web-content-based method, due to the impurity of page content. In this paper, based on query log, a novel web-page-classification method NQPC(Novel Query log-based web-Page Classification) is proposed. Its novelty is that: a low-dimensional feature vector extraction method is proposed to avoid the “curse of dimensionality”; web-page classification is based on high-quality query log, which has purer content than web-page content; a filter method is proposed to improve the classification accuracy. Experimental results show that the web-page-classification method has excellent performance, which gives it good application prospects.

Key words: query log, web-page classification, machine learning, text classification, feature extraction

摘要: 网页分类可对海量网页进行分门别类,可应用于许多方面。现存的网页自动分类方法较多,其中常用的基于网页内容的方法由于网页内容的不纯,导致其存在较大的性能提升空间。基于查询日志,提出了一种新型的网页分类方法NQPC。该方法提出一种低维特征向量抽取方法,从而避免“维度灾难”;基于优质的查询日志进行网页分类,查询日志相对网页内容而言,具有内容较纯的优点;提出一种提升分类准确率的过滤方法。实验结果表明,提出的网页分类方法具有优异的性能表现,使其具有良好的应用前景。

关键词: 查询日志, 网页分类, 机器学习, 文本分类, 特征抽取