计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (8): 81-86.DOI: 10.3778/j.issn.1002-8331.1511-0156

• 大数据与云计算 • 上一篇    下一篇

基于聚类集成的微博话题发现方法

冯旭鹏1,马  震1,谢  波1,刘利军2,黄青松2   

  1. 1.昆明理工大学 教育技术与网络中心,昆明 650500
    2.昆明理工大学 信息工程与自动化学院,昆明 650500
  • 出版日期:2017-04-15 发布日期:2017-04-28

Microblog topic detection method based on clustering ensemble

FENG Xupeng1, MA Zhen1, XIE Bo1, LIU Lijun2, HUANG Qingsong2   

  1. 1.Educational Technology and Campus Network Center, Kunming University of Science and Technology, Kunming 650500, China
    2.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
  • Online:2017-04-15 Published:2017-04-28

摘要: 微博中短文本、用语不规范和大量噪音等特性使得传统话题发现方法不能很好地从中获取新话题。针对微博以上特性和话题动态性提出一种基于聚类集成的微博话题发现方法,该方法考虑微博发布的非线性时间因子,采用改进的K-Means方法分别融合微博的各个特性构造其对应的基聚类器,并评估各基聚类器之间的有效性和差异性,以此设置集成投票权值并最终进行聚类集成。实验对比结果表明,该方法将微博发现话题的准确性提升约9.5%,能够更有效地探测到新话题。

关键词: 短文本, 噪音, 话题发现, 动态性, 非线性时间, 基聚类器, 聚类集成

Abstract: The short text, randomness and a large amount of noise make the traditional methods of topic detection can not be solved to get the new topic, and these topic detection techniques have not considered the time factor of the microblog post. In this paper, the microblog topic detection method based on clustering ensemble is proposed for the characteristics of micro-blog and topic dynamic performance. This method considers the nonlinear time factor of microblog post, the improved K-Means method is used to construct the corresponding base cluster based on each feature of microblog, evaluate the effectiveness and difference between the each cluster, so as to set up the ensemble voting weights and the clustering ensemble is used for microblog topic detection. Experimental results show that the proposed method gets an accuracy up to 9.5% in microblog topic detection, which can detect the new topic more effectively.

Key words: short text, noise, topic detection, dynamic, nonlinear time factor, base cluster, clustering ensemble