Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (13): 150-153.

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Research on text categorization based on LDA

YAO Quanzhu,SONG Zhili,PENG Cheng   

  1. School of Computer Science & Engineering,Xi’an University of Technology,Xi’an 710048,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-05-01 Published:2011-05-01

基于LDA模型的文本分类研究

姚全珠,宋志理,彭 程   

  1. 西安理工大学 计算机科学与工程学院,西安 710048

Abstract: When the text corpuses are high-dimensional and large-scale,the traditional dimension reduction algorithms will expose their limitations.A Chinese text categorization algorithm based on LDA is presented.In the discriminative frame of Support Vector Machine(SVM),Latent Dirichlet Allocation(LDA) is used to give a generative probabilistic model for the text corpus,which reduces each document to fixed valued features——The probabilistic distribution on a set of latent topics.Gibbs sampling is used for parameter estimation.In the process of modeling the corpus,a latent topics-document matrix associated with the corpus has been constructed for training SVM.Standard method of Bayes is used for reference to get the best number of topics.Compared to Vector Space Model(VSM) for text expression combined SVM and the classifier based on Latent Semantic Indexing(LSI) combined SVM,the experimental result shows that the proposed method for text categorization is practicable and effective.

Key words: text categorization, Latent Dirichlet Allocation(LDA), Gibbs sampling, Bayes statistics theory

摘要: 针对传统的降维算法在处理高维和大规模的文本分类时存在的局限性,提出了一种基于LDA模型的文本分类算法,在判别模型SVM框架中,应用LDA概率增长模型,对文档集进行主题建模,在文档集的隐含主题-文本矩阵上训练SVM,构造文本分类器。参数推理采用Gibbs抽样,将每个文本表示为固定隐含主题集上的概率分布。应用贝叶斯统计理论中的标准方法,确定最优主题数T。在语料库上进行的分类实验表明,与文本表示采用VSM结合SVM,LSI结合SVM相比,具有较好的分类效果。

关键词: 文本分类, 潜在狄利克雷分配(LDA)模型, Gibbs抽样, 贝叶斯统计理论