计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (5): 127-133.
• 模式识别与人工智能 • 上一篇 下一篇
史庆伟,从世源
出版日期:
发布日期:
SHI Qingwei, CONG Shiyuan
Online:
Published:
摘要: LDA没有考虑到输入,在原始的输入空间上对每一个词进行主题标签,因保留非作用词,而影响了主题概率分布。针对这种情况提出了一种mRMR_LDA算法,预先使用mRMR特征选择算法将输入空间映射到低维空间,过滤掉非作用词,使得LDA能在更简洁和更清晰的空间上进行主题标签,得到更精确的主题分布。对20 Newsgroups语料库和复旦大学语料库进行分类,分类精度分别提高了1.53%和1.18%,实验结果表明提出的mRMR_LDA模型在文本分类中有较好的分类性能。
关键词: 潜在狄利克雷分配, 最小冗余最大相关, 文本分类
Abstract: The LDA method does not take the input space into consideration effectively, when making topic label to each word in the original space, it holds the non-action words, which affects the probability distribution of the topic extremely. In order to overcome this insufficiency, a new mRMR_LDA algorithm is proposed in this paper. The mRMR maps the input space to the low dimensional space, and filters the non-action words, which makes LDA perform topic label in a simpler and clearer space, so that it can achieve a more precise topic distribution. The classification accuracy of the 20 Newsgroup corpus and the corpus of Fudan University is improved by 1.53% and 1.18% respectively using the proposed algorithm. Experimental results show that the mRMR_LDA model has a better performance in text classification.
Key words: Latent Dirichlet Allocation(LDA), minimum Redundancy Maximum Relevance(mRMR), text categorization
史庆伟,从世源. 基于mRMR和LDA主题模型的文本分类研究[J]. 计算机工程与应用, 2016, 52(5): 127-133.
SHI Qingwei, CONG Shiyuan. Research on text categorization based on mRMR and LDA[J]. Computer Engineering and Applications, 2016, 52(5): 127-133.
0 / 推荐
导出引用管理器 EndNote|Ris|BibTeX
链接本文: http://cea.ceaj.org/CN/
http://cea.ceaj.org/CN/Y2016/V52/I5/127