计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (24): 169-171.

• 数据库与信息处理 • 上一篇    下一篇

组合降维技术在中文网页分类中的应用

李新福   

  1. 河北大学 数学与计算机学院,河北 保定 071002
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-08-21 发布日期:2007-08-21
  • 通讯作者: 李新福

Web page categorization based on LSA and features selection

LI Xin-fu   

  1. College of Mathematics and Computer,Hebei University,Baoding,Hebei 071002,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-08-21 Published:2007-08-21
  • Contact: LI Xin-fu

摘要:

基于向量空间模型的文本分类中特征向量是极度稀疏的高维向量,只有降低向量空间维数才能提高分类效率。在利用统计方法选择文本分类特征降低特征空间维数的基础上,采用隐含语义分析技术,挖掘文档特征间的语义信息,利用矩阵奇异值分解理论进一步降低了特征空间维数。实验结果表明分类结果宏平均F1约提高了5%,验证了该方法的有效性。

关键词: 网页分类, 隐含语义分析, 特征选择, KNN

Abstract: The feature vector of Chinese Web page is high dimension and very sparse for text categorization.How to reduce the dimensionality of feature space is a very key problem for practical text classification.In this paper a new method is described.The approach is to take advantage of latent semantic analysis and feature selection that use statistical methods.The K-Nearest Neighbor method is selected as the evaluating classifiers.The experimental result shows that the proposed method for Chinese Web page categorization to be promising.

Key words: Web Page categorization, latent semantic analysis, feature selection, KNN