Computer Engineering and Applications ›› 2016, Vol. 52 ›› Issue (5): 127-133.

Previous Articles     Next Articles

Research on text categorization based on mRMR and LDA

SHI Qingwei, CONG Shiyuan   

  1. College of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2016-03-01 Published:2016-03-17

基于mRMR和LDA主题模型的文本分类研究

史庆伟,从世源   

  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105

Abstract: The LDA method does not take the input space into consideration effectively, when making topic label to each word in the original space, it holds the non-action words, which affects the probability distribution of the topic extremely. In order to overcome this insufficiency, a new mRMR_LDA algorithm is proposed in this paper. The mRMR maps the input space to the low dimensional space, and filters the non-action words, which makes LDA perform topic label in a simpler and clearer space, so that it can achieve a more precise topic distribution. The classification accuracy of the 20 Newsgroup corpus and the corpus of Fudan University is improved by 1.53% and 1.18% respectively using the proposed algorithm. Experimental results show that the mRMR_LDA model has a better performance in text classification.

Key words: Latent Dirichlet Allocation(LDA), minimum Redundancy Maximum Relevance(mRMR), text categorization

摘要: LDA没有考虑到输入,在原始的输入空间上对每一个词进行主题标签,因保留非作用词,而影响了主题概率分布。针对这种情况提出了一种mRMR_LDA算法,预先使用mRMR特征选择算法将输入空间映射到低维空间,过滤掉非作用词,使得LDA能在更简洁和更清晰的空间上进行主题标签,得到更精确的主题分布。对20 Newsgroups语料库和复旦大学语料库进行分类,分类精度分别提高了1.53%和1.18%,实验结果表明提出的mRMR_LDA模型在文本分类中有较好的分类性能。

关键词: 潜在狄利克雷分配, 最小冗余最大相关, 文本分类