计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (6): 61-66.

• 大数据与云计算 • 上一篇    下一篇

结合互信息和主题模型的微博话题发现方法

孙曰昕,马慧芳,姚  伟,张志昌   

  1. 西北师范大学 计算机科学与工程学院,兰州 730070
  • 出版日期:2016-03-15 发布日期:2016-03-17

Microblog hot topic detection based on positive point mutual information and probabilistic topic model

SUN Yuexin, MA Huifang, YAO Wei, ZHANG Zhichang   

  1. College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
  • Online:2016-03-15 Published:2016-03-17

摘要: 为了解决短文本信息流的特征稀疏性对热点话题发现带来的挑战,提出了结合词语互信息和概率主题模型的微博热点话题发现方法。通过建立词共现矩阵并应用对称非负矩阵分解算法获取词项-主题矩阵,再利用概率潜在语义分析模型进行主题发现,最终通过定义微博热度分析和排序,有效地支持微博热点话题发现。实验表明,此方法能有效地进行话题聚类并检测出热点话题。

关键词: 词共现矩阵, 对称非负矩阵分解, 概率潜在语义分析, 微博热点话题发现

Abstract: In order to face the challenges of feature sparsely of short text messages for microblog hot topic detection, this paper proposes a hot topic detection method based on the combination of term mutual information and probabilistic topic model. Symmetric Nonnegative Matrix Factorization(sNMF) is performed on word co-occurrence with word mutual information and the matrix of term-topic matrix is thereafter inferred. Probabilistic Latent Semantic Analysis(pLSA) model is then adopted to model the topic-microblog. The hotness of topic is analyzed and sorted. Experiments show that this method can effectively cluster and detect the hot topics.

Key words: term co-occurrence matrix, symmetrical nonnegative matrix factorization, probabilistic latent semantic analysis, micro-blog hot topic detection