计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (14): 138-147.DOI: 10.3778/j.issn.1002-8331.1905-0031

• 模式识别与人工智能 • 上一篇    下一篇

基于Spark的快速短文本数据流分类方法

胡阳,胡学钢,李培培   

  1. 1.合肥工业大学 计算机与信息学院,合肥 230009
    2.工业安全与应急技术安徽省重点实验室,合肥 230009
  • 出版日期:2020-07-15 发布日期:2020-07-14

Fast Short Text Data Stream Classification Method Based on Spark

HU Yang, HU Xuegang, LI Peipei   

  1. 1.School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China
    2.Key Laboratory of Industrial Safety and Emergency Technology Anhui Province, Hefei 230009, China
  • Online:2020-07-15 Published:2020-07-14

摘要:

微博、脸书等社交网络平台涌现的短文本数据流具有海量、高维稀疏、快速可变等特性,使得短文本数据流分类面临着巨大挑战。已有的短文本数据流分类方法难以有效地解决特征高维稀疏问题,并且在处理海量数据流时时间代价较高。基于此,提出一种基于Spark的分布式快速短文本数据流分类方法。一方面,利用外部语料库构建Word2vec词向量模型解决了短文本的高维稀疏问题,并构建扩展词向量库以适应文本的快速可变性,提出一种LR分类器集成模型用于短文本数据流分类,该分类器使用一种FTRL方法实现模型参数的在线更新,并引入时间因子加权机制以适应概念漂移环境;另一方面,所提方法的使用分布式处理提高了海量短文本数据流的处理效率。在3个真实短文本数据流上的实验表明:所提方法在提高分类精度的同时,降低了时间消耗。

关键词: 短文本数据流分类, 分布式处理, Spark环境, 概念漂移

Abstract:

Short text data streams emerging on social network platforms such as Weibo and Facebook have the characteristics of magnanimity, high-dimension, sparsity and fast variable, and it is hence a huge challenge for the short text data stream classification. Existing short text data stream classification methods are difficult to effectively solve the high-dimensional and sparse feature problem, and spend higher time costs in the processing of massive data streams. Motivated by this, a distributed fast short text data stream classification method based on Spark is proposed in this paper. On the one hand, the external corpus is used to construct the Word2vec model to solve the high-dimension and sparsity issue of short texts, and the extended word vector library is constructed to adapt to the fast variability of the texts. Then, an LR classifier integration model is proposed for classifying short text data streams. The classifier utilizes an FTRL method to implement online update of model parameters, and introduces a time factor weighting mechanism to adapt to the concept drift environment. On the other hand, the proposed method uses distributed processing to improve the performance of handling with massive short text streams. Finally, experiments conducted on three real short text data streams show that the proposed method greatly reduces the time consumption while improving the classification accuracy.

Key words: short text data stream classification, distributed processing, Spark, concept drift