融合BERT和自编码网络的短文本聚类研究

doi:10.3778/j.issn.1002-8331.2102-0223

摘要/Abstract

摘要： 短文本相比于长文本词汇的数量更少，提取其中的语义特征信息更加困难，利用传统的向量空间模型VSM（vector space model）向量化表示，容易得到高维稀疏的向量。词的稀疏表示缺少语义相关性，造成语义鸿沟，从而导致下游聚类任务中，准确率低下，容易受噪声干扰等问题。提出一种新的聚类模型BERT_AE_[K]-Means，利用预训练模型BERT（bidirectional encoder representations from transformers）作为文本表示的初始化方法，利用自动编码器AutoEncoder对文本表示向量进行自训练以提取高阶特征，将得到的特征提取器Encoder和聚类模型[K]-Means进行联合训练，同时优化特征提取模块和聚类模块，提高聚类模型的准确度和鲁棒性。所提出的模型在四个数据集上与Word2Vec_[K]-Means和STC2等6个模型相比，准确率和标准互信息都有所提高，在SearchSnippet数据集上的准确率达到82.28%，实验结果显示，所提方法有效地提高了短文本聚类的准确度。

关键词: 短文本聚类, 自动编码器, 自然语言处理, BERT

Abstract: Compared with long text, short text has fewer words, so it is more difficult to extract the semantic feature information. Using traditional vector space model（VSM） vectorization, it is easy to get high-dimensional sparse vector. The sparse representation of words lacks semantic relevance, which leads to semantic gap, which leads to low accuracy and noise interference in downstream clustering tasks. In view of this, a new clustering model BERT_ AE_[K]-Means is proposed, using the pre training model BERT（bidirectional encoder representations from transformers）. Then the AutoEncoder is used to self train the text representation vector to extract high-order features. Finally, the feature extractor Encoder and the clustering model [K]-Means are jointly trained. At the same time, the feature extraction module and the clustering module are optimized to improve the accuracy and robustness of the clustering model. The proposed model is compared with Word2Vec_[K]-Means, STC2 and other six models on four datasets, the accuracy and standard mutual information of [K]-Means are improved, and the accuracy on SearchSnippet dataset is 82.28%. Experimental results show that the proposed method can effectively improve the accuracy of short text clustering.

Key words: short text clustering, AutoEncoder, natural language processing, bidirectional encoder representations from transformers（BERT）

朱良奇, 黄勃, 黄季涛, 马莉媛, 史志才. 融合BERT和自编码网络的短文本聚类研究[J]. 计算机工程与应用, 2022, 58(2): 145-152.

ZHU Liangqi, HUANG Bo, HUANG Jitao, MA Liyuan, SHI Zhicai. Research on Short Text Clustering Based on BERT and AutoEncoder[J]. Computer Engineering and Applications, 2022, 58(2): 145-152.

参考文献

[1] RAJARAMAN A，UIIMAN J D.Mining of massive datasets[M].Cambridge：Cambridge University Press，2011：1-17.
[2] DEVLIN J，CHANG M W，LEE K，et al.BERT：pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies.Stroudsburg，PA：Association for Computational Linguistics，2019：4171-4186.
[3] VAN DER MAATEN L，HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research，2008，9：2579-2605.
[4] XIE J Y，GIRSHICK R，FARHADI A.Unsupervised deep embedding for clustering analysis[C]//Proceedings of the 33rd International Conference on Machine Learning，2016：478-487.
[5] XU J M，WANG P，TIAN G H，et al.Short text clustering via convolutional neural networks[C]//Proceedings of NAACL-HLT Association for Computational Linguistics，2015：62-69.
[6] HU X，ZHANG X，LU C，et al.Exploiting Wikipedia as external knowledge for document clustering[C]//Proceedings of SIGKDD International Conference on Knowledge Discovery and Data Mining，2009：389-396.
[7] BANERJEE S，RAMANATHAN K，GUPTA A.Clustering short texts using Wikipedia[C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval，2007：787-788.
[8] HOTHO A，STAAB S，STUMME G.Ontologies improve text document clustering[C]//Third IEEE International Conference on Data Mining，Melbourne，2003：541-544.
[9] WEI T T，LU Y H，CHANG H Y，et al.A semantic approach for text clustering using WordNet and lexical chains[J].Expert Systems with Applications，2015，42：2264-2275.
[10] KOZLOWSKI M，RYBINSKI H.Clustering of semantically enriched short texts[J].Journal of Intelligent Information Systems，2019，53：69-92.
[11] ZHENG C T，LIU C，SAN W H.Corpus-based topic diffusion for short text clustering[J].Neurocomputing，2018，275：2444-2458.
[12] MIKOLOV T，SUTSKEVER I，CHEN K，et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems 26（NIPS），2013：3111-3119.
[13] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[14] YANG Z，YANG D，DYER C，et al.Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2016：1480-1489.
[15] WANG B，ZHANG X，ZHOU X，et al.A gated dilated convolution with attention model for clinical cloze-style reading comprehension[J].International Journal of Environmental Research and Public Health，2020，17（4）：1323.
[16] ADHIKARI A，RAM A，TANG R，et al.Docbert：BERT for document classification[J].arXiv：1904.08398，2019.
[17] ASHISH V，NOAM S，NIKI P，et al.Attention is all you need[C]//Advances in Neural Information Processing，2017：5998-6008.
[18] PHAN X，NGUYEN L，HORIGUCHI S.Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceedings of the 17th International Conference：World Wide Web，Beijing，China，2008：91-100.
[19] XU J M，XU B，WANG P，et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks，2017，88：22-31.
[20] YIN J，WANG J.A model-based approach for text clustering with outlier detection[C]//2016 IEEE 32nd International Conference on Data Engineering（ICDE），2016：625-636.
[21] HADIFAR A，STERCKX L.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP.Florence，Italy：Association for Computational Linguistics，2019：194-199.