计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (1): 96-100.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于隐含语义分析的微博话题发现方法

马雯雯1,魏文晗1,邓一贵1,2   

  1. 1.重庆大学 计算机学院,重庆 400044
    2.重庆大学 信息与网络管理中心,重庆 400044
  • 出版日期:2014-01-01 发布日期:2013-12-30

Micro-blog topic detection method based on Latent Semantic Analysis

MA Wenwen1, WEI Wenhan1, DEGN Yigui1,2   

  1. 1.School of Computer Science, Chongqing University, Chongqing 400044, China
    2.Center of Information and Network, Chongqing University, Chongqing 400044, China
  • Online:2014-01-01 Published:2013-12-30

摘要: 随着微博的大量普及和关注度的不断提高,微博热点话题发现已成为当前研究热点。针对于短文本、向量空间模型(VSM)文本表示方法存在高维度、稀疏,以及同义多义问题,导致难以准确度量文本相似度,提出一种基于隐含语义分析的两阶段聚类话题发现方法。引入话题热度的概念来选取具有一定关注度的微博文本,用隐含语义分析(LSA)对数据集进行建模;用层次聚类的CURE算法确定初始类中心;用K-means聚类得到热点话题的聚类结果。真实微博数据集的实验结果验证了该方法的有效性。

关键词: 隐含语义分析, 向量空间模型, 话题发现, 微博, 两阶段聚类

Abstract: As the large popularity of micro-blog and awareness continues to improve, hot topic of micro-blog detecting has become the current research focuses. For short texts, there exist high-dimension, sparse, synonymy and polysemy problems for Vector Space Model(VSM) text presentation, making it difficult to measure the similarity of the texts accurately. This paper presents a two-stage cluster based on Latent Semantic Analysis(LSA) topic detection approach. Firstly, the concept of hot topic is introduced to select micro-blogs with certain attention, using LSA to model the dataset. Then CURE algorithm of hierarchical clustering is employed to determine the initial centers. Finally, the hot topic clustering results are obtained through K-means clustering. Experimental results on real micro-blog dataset verify the validity of the method.

Key words: Latent Semantic Analysis(LSA), Vector Space Model(VSM), topic detection, micro-blog, two-stage clustering