NQPC：一种新型的基于查询日志的网页分类方法

计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (11): 82-87.

NQPC：一种新型的基于查询日志的网页分类方法

刘祥涛1，2，刘书良3

1.广东电子工业研究院，广东东莞 523808
2.中国科学院计算技术研究所，北京 100190
3.亿赞普科技有限公司，北京 100081

出版日期:2012-04-11 发布日期:2012-04-16

NQPC：novel query log-based web-page classification method

LIU Xiangtao1，2, LIU Shuliang3

1.Guangdong Electronics Industry Institute, Dongguan, Guangdong 523808, China
2.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
3.IZP Technology Co., Ltd., Beijing 100081, China

Online:2012-04-11 Published:2012-04-16

摘要/Abstract

摘要： 网页分类可对海量网页进行分门别类，可应用于许多方面。现存的网页自动分类方法较多，其中常用的基于网页内容的方法由于网页内容的不纯，导致其存在较大的性能提升空间。基于查询日志，提出了一种新型的网页分类方法NQPC。该方法提出一种低维特征向量抽取方法，从而避免“维度灾难”；基于优质的查询日志进行网页分类，查询日志相对网页内容而言，具有内容较纯的优点；提出一种提升分类准确率的过滤方法。实验结果表明，提出的网页分类方法具有优异的性能表现，使其具有良好的应用前景。

关键词: 查询日志, 网页分类, 机器学习, 文本分类, 特征抽取

Abstract: Web-page classification can be utilized to categorize massive web-pages and thus can be utilized in lots of areas. There are quite a few existing automatic web-page classification methods, among which there is large performance improvement space for the commonly-used web-content-based method, due to the impurity of page content. In this paper, based on query log, a novel web-page-classification method NQPC（Novel Query log-based web-Page Classification） is proposed. Its novelty is that: a low-dimensional feature vector extraction method is proposed to avoid the “curse of dimensionality”; web-page classification is based on high-quality query log, which has purer content than web-page content; a filter method is proposed to improve the classification accuracy. Experimental results show that the web-page-classification method has excellent performance, which gives it good application prospects.

Key words: query log, web-page classification, machine learning, text classification, feature extraction

刘祥涛1，2，刘书良3. NQPC：一种新型的基于查询日志的网页分类方法[J]. 计算机工程与应用, 2012, 48(11): 82-87.

LIU Xiangtao1，2, LIU Shuliang3. NQPC：novel query log-based web-page classification method[J]. Computer Engineering and Applications, 2012, 48(11): 82-87.

[1]	冉蓉，徐兴华，邱少华，崔小鹏，欧阳斌. 基于深度卷积神经网络的裂纹检测方法综述[J]. 计算机工程与应用, 2021, 57(9): 23-35.
[2]	韦佶宏，郑荣锋，刘嘉勇. 基于混合神经网络的恶意TLS流量识别研究[J]. 计算机工程与应用, 2021, 57(7): 107-114.
[3]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[4]	张晓丽，张魁星，江梅，魏本征，丛金玉. 淋巴瘤图像分类技术研究综述[J]. 计算机工程与应用, 2021, 57(6): 1-9.
[5]	韩东方，吐尔地·托合提，艾斯卡尔·艾木都拉. 问答系统中问句分类方法研究综述[J]. 计算机工程与应用, 2021, 57(6): 10-21.
[6]	黄金杰，蔺江全，何勇军，何瑾洁，王雅君. 局部语义与上下文关系的中文短文本分类算法[J]. 计算机工程与应用, 2021, 57(6): 94-100.
[7]	万梦翔，姚寒冰. 面向恶意网页训练数据生成的GAN模型[J]. 计算机工程与应用, 2021, 57(6): 124-130.
[8]	杨晔民，张慧军，张小龙. 随机森林的可解释性可视分析方法研究[J]. 计算机工程与应用, 2021, 57(6): 168-175.
[9]	徐可文，许波，吴英，徐浩然. 机器学习在超声图像中的应用综述[J]. 计算机工程与应用, 2021, 57(4): 11-17.
[10]	王振东，张林，李大海. 基于机器学习的物联网入侵检测系统综述[J]. 计算机工程与应用, 2021, 57(4): 18-27.
[11]	郑诚，董春阳，黄夏炎. 基于BTM图卷积网络的短文本分类方法[J]. 计算机工程与应用, 2021, 57(4): 155-160.
[12]	贺文亮，朱敏玲. 胶囊神经网络研究现状与未来的浅析[J]. 计算机工程与应用, 2021, 57(3): 33-43.
[13]	吕品，武秦娟，许嘉. 上市公司文本信息披露智能分析研究综述[J]. 计算机工程与应用, 2021, 57(24): 1-13.
[14]	张隅希，段宗涛，朱依水，王路阳，周祎，郭宇. 机动车油耗模型研究综述[J]. 计算机工程与应用, 2021, 57(24): 14-26.
[15]	安卫超，阎婷，张楠，张杉，相洁，曹锐，王彬. 病理图像纹理分析在胃癌MSI预测中的应用研究[J]. 计算机工程与应用, 2021, 57(24): 205-211.