计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (2): 145-152.DOI: 10.3778/j.issn.1002-8331.2102-0223

• 模式识别与人工智能 • 上一篇    下一篇

融合BERT和自编码网络的短文本聚类研究

朱良奇,黄勃,黄季涛,马莉媛,史志才   

  1. 1.上海工程技术大学 电子电气工程学院,上海 201620
    2.上海信息安全综合管理技术重点实验室,上海 200240
  • 出版日期:2022-01-15 发布日期:2022-01-18

Research on Short Text Clustering Based on BERT and AutoEncoder

ZHU Liangqi, HUANG Bo, HUANG Jitao, MA Liyuan, SHI Zhicai   

  1. 1.School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
    2.Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai 200240, China
  • Online:2022-01-15 Published:2022-01-18

摘要: 短文本相比于长文本词汇的数量更少,提取其中的语义特征信息更加困难,利用传统的向量空间模型VSM(vector space model)向量化表示,容易得到高维稀疏的向量。词的稀疏表示缺少语义相关性,造成语义鸿沟,从而导致下游聚类任务中,准确率低下,容易受噪声干扰等问题。提出一种新的聚类模型BERT_AE_[K]-Means,利用预训练模型BERT(bidirectional encoder representations from transformers)作为文本表示的初始化方法,利用自动编码器AutoEncoder对文本表示向量进行自训练以提取高阶特征,将得到的特征提取器Encoder和聚类模型[K]-Means进行联合训练,同时优化特征提取模块和聚类模块,提高聚类模型的准确度和鲁棒性。所提出的模型在四个数据集上与Word2Vec_[K]-Means和STC2等6个模型相比,准确率和标准互信息都有所提高,在SearchSnippet数据集上的准确率达到82.28%,实验结果显示,所提方法有效地提高了短文本聚类的准确度。

关键词: 短文本聚类, 自动编码器, 自然语言处理, BERT

Abstract: Compared with long text, short text has fewer words, so it is more difficult to extract the semantic feature information. Using traditional vector space model(VSM) vectorization, it is easy to get high-dimensional sparse vector. The sparse representation of words lacks semantic relevance, which leads to semantic gap, which leads to low accuracy and noise interference in downstream clustering tasks. In view of this, a new clustering model BERT_ AE_[K]-Means is proposed, using the pre training model BERT(bidirectional encoder representations from transformers). Then the AutoEncoder is used to self train the text representation vector to extract high-order features. Finally, the feature extractor Encoder and the clustering model [K]-Means are jointly trained. At the same time, the feature extraction module and the clustering module are optimized to improve the accuracy and robustness of the clustering model. The proposed model is compared with Word2Vec_[K]-Means, STC2 and other six models  on four datasets, the accuracy and standard mutual information of [K]-Means are improved, and the accuracy on SearchSnippet dataset is 82.28%. Experimental results show that the proposed method can effectively improve the accuracy of short text clustering.

Key words: short text clustering, AutoEncoder, natural language processing, bidirectional encoder representations from transformers(BERT)