Fast Short Text Data Stream Classification Method Based on Spark

doi:10.3778/j.issn.1002-8331.1905-0031

Abstract

Abstract:

Short text data streams emerging on social network platforms such as Weibo and Facebook have the characteristics of magnanimity, high-dimension, sparsity and fast variable, and it is hence a huge challenge for the short text data stream classification. Existing short text data stream classification methods are difficult to effectively solve the high-dimensional and sparse feature problem, and spend higher time costs in the processing of massive data streams. Motivated by this, a distributed fast short text data stream classification method based on Spark is proposed in this paper. On the one hand, the external corpus is used to construct the Word2vec model to solve the high-dimension and sparsity issue of short texts, and the extended word vector library is constructed to adapt to the fast variability of the texts. Then, an LR classifier integration model is proposed for classifying short text data streams. The classifier utilizes an FTRL method to implement online update of model parameters, and introduces a time factor weighting mechanism to adapt to the concept drift environment. On the other hand, the proposed method uses distributed processing to improve the performance of handling with massive short text streams. Finally, experiments conducted on three real short text data streams show that the proposed method greatly reduces the time consumption while improving the classification accuracy.

Key words: short text data stream classification, distributed processing, Spark, concept drift

摘要：

微博、脸书等社交网络平台涌现的短文本数据流具有海量、高维稀疏、快速可变等特性，使得短文本数据流分类面临着巨大挑战。已有的短文本数据流分类方法难以有效地解决特征高维稀疏问题，并且在处理海量数据流时时间代价较高。基于此，提出一种基于Spark的分布式快速短文本数据流分类方法。一方面，利用外部语料库构建Word2vec词向量模型解决了短文本的高维稀疏问题，并构建扩展词向量库以适应文本的快速可变性，提出一种LR分类器集成模型用于短文本数据流分类，该分类器使用一种FTRL方法实现模型参数的在线更新，并引入时间因子加权机制以适应概念漂移环境；另一方面，所提方法的使用分布式处理提高了海量短文本数据流的处理效率。在3个真实短文本数据流上的实验表明：所提方法在提高分类精度的同时，降低了时间消耗。

关键词: 短文本数据流分类, 分布式处理, Spark环境, 概念漂移

HU Yang, HU Xuegang, LI Peipei. Fast Short Text Data Stream Classification Method Based on Spark[J]. Computer Engineering and Applications, 2020, 56(14): 138-147.

胡阳，胡学钢，李培培. 基于Spark的快速短文本数据流分类方法[J]. 计算机工程与应用, 2020, 56(14): 138-147.

[1]	LI Junli. Parallel Mutual-Information Computation of Categorical Data Based on Spark [J]. Computer Engineering and Applications, 2021, 57(7): 95-100.
[2]	LI Shuo, LIANG Yi. Prediction Model of Execution Time for Batch Application in Spark [J]. Computer Engineering and Applications, 2021, 57(5): 79-87.
[3]	WANG Junhong, GUO Yahui. Imbalanced Data Stream Classification Algorithm for Dynamic Data Chunk [J]. Computer Engineering and Applications, 2021, 57(13): 124-129.
[4]	LI Chao, DONG Xinhua, CHEN Jianxia. Asynchronous Iterative Updates Method Based on Subgraph in Spark [J]. Computer Engineering and Applications, 2020, 56(7): 67-73.
[5]	WEI Zhanchen, LIU Xiaoyu, HUANG Qiulan, SUN Gongxing. Research on Optimization for Iteration-Intensive Applications on Spark [J]. Computer Engineering and Applications, 2020, 56(23): 68-73.
[6]	WANG Yonggui, GUO Xintong. Efficient Frequent Set Mining Algorithm for Adaptive Data Sets on SparkSql [J]. Computer Engineering and Applications, 2020, 56(21): 72-78.
[7]	XU Qingyan, HE Li, ZHU Hongxi. Improved Detection Method of Concept Drift Based on the Hoeffding Inequality [J]. Computer Engineering and Applications, 2020, 56(19): 55-61.
[8]	LIU Jiayao, WANG Jiabin. Improvement of Slope One Algorithm and Its Implementation on Big Data Platform [J]. Computer Engineering and Applications, 2020, 56(1): 83-91.
[9]	LIU Liping1, ZHANG Xinyou1, NIU Xiaolu2, GUO Yongkun1, DING Liang1. Survey of Spark-Based Parallel Association Rules Mining Algorithm [J]. Computer Engineering and Applications, 2019, 55(9): 1-9.
[10]	TENG Zengde, LIAO Zhuhua. Differentiated Service Mechanism for Data Query on Named Data Networking [J]. Computer Engineering and Applications, 2019, 55(9): 17-25.
[11]	CHEN Xining1，2, MA Weiyin3, LI Li4. Fingerprint Localization Data Processing Method Based on Spark [J]. Computer Engineering and Applications, 2019, 55(4): 79-83.
[12]	JIANG Zhendong1, WANG Jianming1, PAN Wubin2. Adaptive Traffic Classification Approach Based on Concept Drift Detection [J]. Computer Engineering and Applications, 2019, 55(3): 68-75.
[13]	TAN Di, DUAN Guihua, WANG Jianxin, REN Linan. Research on Prediction and Alarm of Transaction Volume Oriented to Banking Business [J]. Computer Engineering and Applications, 2019, 55(12): 220-224.
[14]	QU Zhaoyang1，2, FENG Rongqiang1，2, QU Nan3, XIE Shuya1，2, LIU Yaowei4, YAN Jia4. Recommendation Method of Power Selling Packages Considering Spark and Attribute Weights [J]. Computer Engineering and Applications, 2019, 55(10): 90-95.
[15]	ZENG Youling, CHEN Gengduo, XIONG Wei, LI Zhe. Parallel Design of FBP Reconstruction Algorithm for CT Image Based on Spark [J]. Computer Engineering and Applications, 2019, 55(10): 218-224.

Fast Short Text Data Stream Classification Method Based on Spark

基于Spark的快速短文本数据流分类方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics