计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (17): 73-78.

• 大数据与云计算 • 上一篇    下一篇

基于词向量的微博事件追踪方法

张佳明,席耀一,王  波,唐浩浩,李天彩   

  1. 解放军信息工程大学 信息系统工程学院,郑州 450001
  • 出版日期:2016-09-01 发布日期:2016-09-14

Method of micro-blog event tracking based on word vector

ZHANG Jiaming, XI Yaoyi, WANG Bo, TANG Haohao, LI Tiancai   

  1. Institute of Information and System Engineering, PLA Information Engineering University, Zhengzhou 450001, China
  • Online:2016-09-01 Published:2016-09-14

摘要: 微博文本长度短,且网络新词层出不穷,使得传统方法在微博事件追踪中效果不够理想。针对该问题,提出一种基于词向量的微博事件追踪方法。词向量不仅可以计算词语之间的语义相似度,而且能够提高微博间语义相似度计算的准确率。该方法首先使用Skip-gram模型在大规模数据集上训练得到词向量;然后通过提取关键词建立初始事件和微博表示模型;最后利用词向量计算微博和初始事件之间的语义相似度,并依据设定阈值进行判决,完成事件追踪。实验结果表明,相比传统方法,该方法能够充分利用词向量引入的语义信息,有效提高微博事件追踪的性能。

关键词: 微博, 事件追踪, 短文本, Skip-gram模型, 词向量, 语义信息

Abstract: The traditional methods in micro-blog events tracking do not achieve good performance, because the length of micro-blog text is shorter and the cyber-words emerge constantly. To solve this problem, a method of micro-blog event tracking based on word vector is proposed. By using word vector, semantic similarity between the words can be computed, and the accuracy of semantic similarity between micro-blogs can also be improved. Firstly, the Skip-gram model is trained to get the word vector by using a large dataset. Then, the models for initial event and micro-blogs are constructed by extracting the keywords. Finally, the semantic similarities between micro-blogs and the initial event are computed through word vector, and the task of event tracking is completed according to the decision of pre-defined threshold. The experimental results show that the proposed method can make full use of semantic information contained by word vector, which can effectively improve the tracking performance compared with traditional methods.

Key words: micro-blog, event tracking, short text, Skip-gram model, word vector, semantic information