Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (34): 121-123.DOI: 10.3778/j.issn.1002-8331.2009.34.037

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Feature selection method combined on optimized document frequency with LSA

ZHU Hao-dong1,2,ZHONG Yong1,2   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu 610041,China
    2.Graduate University of Chinese Academy of Sciences,Beijing 100039,China
  • Received:2008-12-09 Revised:2009-02-27 Online:2009-12-01 Published:2009-12-01
  • Contact: ZHU Hao-dong

结合优化的文档频和LSA的特征选择方法

朱颢东1,2,钟 勇1,2   

  1. 1.中国科学院 成都计算机应用研究所,成都 610041
    2.中国科学院 研究生院,北京 100039
  • 通讯作者: 朱颢东

Abstract: In order to improve efficiency and accuracy of text categorization algorithms,feature selection algorithm must be used.However,a number of feature selection algorithms selected features by means of weights and do not take into consideration features of hidden relationship,so selected feature subset has some redundancy and is not better representative.This paper presents document frequency method based on minimum word frequency and uses this method to filter out some terms to reduce the sparsity of text matrix,then LSA method is used to analyze semanteme among words and to eliminate the influence of synonyms and polysemous words.The combined method raises the speed and accuracy of text categorization.The experimental results show that the combined method is promising.

Key words: text categorization, word frequency, document frequency, Latent Semantic Analysis(LSA)

摘要: 为了提高文本分类算法的效率和精度,必须使用特征选择算法来降低特征空间的维数。然而许多常用特征选择算法在选择属性时,只是利用特征的权重而并没有考虑特征之间的隐含关系,使得得到的特征集存在一定的冗余,并不具备较好的代表性。首先给出了一个基于最小词频的文档频方法,并用它过滤掉一些词条以降低文本矩阵的稀疏性,然后使用LSA进行词语间的语义分析,消除同义词和多义词的影响,提高了文本分类的速度与精确度。实验结果表明此种特征选择方法效果良好。

关键词: 文本分类, 词频, 文档频, 潜在语义分析

CLC Number: