计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (17): 150-154.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于语义共现图的中文微博新闻话题识别

王路路,郑  涛,程倩倩,姬东鸿   

  1. 武汉大学 计算机学院,武汉 430072
  • 出版日期:2014-09-01 发布日期:2014-09-12

Discovering news topics from microblogs based on semantic co-occurrence

WANG Lulu, ZHENG Tao, CHENG Qianqian, JI Donghong   

  1. School of Computer, Wuhan University, Wuhan 430072, China
  • Online:2014-09-01 Published:2014-09-12

摘要: 提出一种在大规模微博短文本数据集中自动发现新闻话题的方法。该方法在微博数据预处理之后,综合TF-IDF、文档频率增长率和命名实体识别等几个因素抽取微博数据中的主题词。根据主题词之间的语义关系来构建主题词的语义共现图,计算出语义共现图的连通子图,把每个不连通的簇集看成一个新闻话题。在新浪微博数据集上进行实验,实现了对微博中新闻话题的识别。该方法能较好检测出当前时间的热门话题,能够在一定程度上有效地避免错误传播,实验结果验证了该方法的有效性。

关键词: 微博, 题词, 语义共现图, 新闻话题识别

Abstract: A method of news topics detection from large-scale short posts of microblogs is proposed. The TF-IDF, the document frequency increase rate and the named entity recognition are considered to extract new keywords from microblogs after pretreatment. A semantic co-occurrence graph is build by co-occurrence degrees of keywords, each unconnected cluster in a semantic co-occurrence graph is taken as a news topic. Experiments are taken on Sina microblogs data sets and the experimental results show the proposed method works well.

Key words: microblog, keywords, semantic co-occurrence graph, news topic detection