计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (22): 32-37.

• 博士论坛 • 上一篇    下一篇

一种面向大规模微博数据的话题挖掘方法

王文帅1,2,杜  然1,2,程耀东1,陈  刚1   

  1. 1.中国科学院 高能物理研究所 计算中心,北京 100049
    2.中国科学院大学,北京 100049
  • 出版日期:2014-11-15 发布日期:2014-11-13

Topic mining method on massive microblog data

WANG Wenshuai1,2, DU Ran1,2, CHENG Yaodong1, CHEN Gang1   

  1. 1.Computing Center,Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
  • Online:2014-11-15 Published:2014-11-13

摘要: 随着微博的日趋流行,新浪微博已成为公众获取和传播信息的重要平台之一,针对微博数据的话题挖掘也成为当前的研究热点。提出一个面向大规模微博数据的话题挖掘方法。首先对大规模微博数据进行分析,基于Bloom Filter算法对数据进行去重处理,针对微博的特有结构,对文本进行预处理,提出改进的LDA主题模型Social Network LDA(SNLDA),采用吉布斯采样法进行模型推导,挖掘出微博话题。实验结果表明,方法能有效地从大规模微博数据中挖掘出话题信息。

关键词: 微博, Bloom Filter, 社会网络主题模型分析(SNLDA), 话题挖掘

Abstract: With the daily popularity of microblog, Sina Weibo has become one of the important public access to and dissemination of information platform, microblog topic mining has become a current research focuses. This paper proposes a topic mining method on massive Social Network data. This paper analyzes the large-scale microblog data, uses Bloom Filter algorithm to eliminate the duplicate data. In view of the special structure of microblog, filter the text. SNLDA, an improved LDA topic model is proposed in this paper, Gibbs sampling is chosen to deduce the model, which can mine the microblog topics. The experimental results show that the method can effectively excavate the topics from the large-scale microblog data.

Key words: microblog, Bloom Filter, Social Network LDA(SNLDA), topic mining