计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (7): 131-135.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于LDA的条件随机场主题模型研究

史庆伟,郭朋亮   

  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
  • 出版日期:2015-04-01 发布日期:2015-03-31

Conditional random fields topic model based on LDA model

SHI Qingwei, GUO Pengliang   

  1. College of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2015-04-01 Published:2015-03-31

摘要: 使用主题模型对文本建模,提取文本的隐含主题,进而进行词性标注和文本分类等工作,是机器学习和文本挖掘领域的研究热点。提出一个基于LDA的主题模型,它基于“段袋“假设——文本中的段落具有相同的主题,且连续的段落更倾向于具有相同的主题。对于文章的段落,采用条件随机场(CRF)模型划分并判断它们是否具有相同主题。实验表明,新模型相比LDA模型能更好得提取主题并具有更低的困惑度,同时,能够较好地进行词性标注和文本分类工作。

关键词: 潜在的狄利克雷分配(LDA), 条件随机场, 主题

Abstract: Using the topic model to model text and extract latent topic for part-of-speech tagging and document classification is a hot spot in the machine learning and text mining areas. This paper proposes a new model which based on LDA and an assumption called “section of the bag” that paragraph has the same topic, and the successive paragraphs tend to have the same topic. For passages from the article, it uses Conditional Random Field(CRF) model to divide them and judge whether they have the same topic. Experiments show that the improved model compared with LDA model has better topic extraction ability and lower degree of perplexity. At the same time, the improved model has better performance in part-speech-tagging and document classification.

Key words: Latent Dirichlet Allocation(LDA), conditional random fields, topic