基于BERT的嵌入式文本主题模型研究

doi:10.3778/j.issn.1002-8331.2106-0048

摘要/Abstract

摘要： 主题模型能够从海量文本数据中挖掘语义丰富的主题词，在文本分析的相关任务中发挥着重要作用。传统LDA主题模型在使用词袋模型表示文本时，无法建模词语之间的语义和序列关系，并且忽略了停用词与低频词。嵌入式主题模型（ETM）虽然使用Word2Vec模型来表示文本词向量解决上述问题，但在处理不同语境下的多义词时，通常将其表示为同一向量，无法体现词语的上下文语义差异。针对上述问题，设计了一种基于BERT的嵌入式主题模型BERT-ETM进行主题挖掘，在国内外通用数据集和《软件工程》领域文本语料上验证了所提方法的有效性。实验结果表明，该方法能克服传统主题模型存在的不足，主题一致性、多样性明显提升，在建模一词多义问题时表现优异，尤其是结合中文分词的WoBERT-ETM，能够挖掘出高质量、细粒度的主题词，对大规模文本十分有效。

关键词: 主题模型, BERT模型, 词嵌入, 词向量可视化

Abstract: Topic model can mining topic words with rich semantics from the massive text data, and plays an important role in the related tasks of text analysis. When the traditional LDA topic model uses word-bag model to represent text, it cannot model the semantic and sequence relationship between words, and ignore the words of deactivation and low frequency. Although the embedded topic model（ETM） solves the above problems by using Word2Vec model to represent the word vector of text, it usually represents the same vector when dealing with polysemy words in different contexts, which cannot reflect the semantic differences of words. To solve the above problems, a kind of ETM based on BERT named BERT-ETM is designed to mine the topic. The effectiveness of the proposed method is verified in general datasets at home and abroad and the text corpus of software engineering. The experimental results show that the method can overcome the shortcomings of traditional topic models, and the coherence and diversity of topic are improved obviously and performs well in modeling polysemy of a word, especially WoBERT-ETM combined with Chinese word segmentation, can dig out high-quality and fine-grained topic words, which is very effective for large vocabulary.

Key words: topic model, BERT model, word embedding, word vector visualization

王宇晗, 林民, 李艳玲, 赵佳鹏. 基于BERT的嵌入式文本主题模型研究[J]. 计算机工程与应用, 2023, 59(1): 169-179.

WANG Yuhan, LIN Min, LI Yanling, ZHAO Jiapeng. Research on Embedded Text Topic Model Based on BERT[J]. Computer Engineering and Applications, 2023, 59(1): 169-179.

参考文献

[1] HOFMANN T.Probabilistic latent semantic indexing[C]//Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval，1999.
[2] 雷明珠，邵新慧.短文本分类模型的优化及应用[J].计算机应用研究，2021，38（6）：1775-1779.
LEI M Z，SHAO X H.Optimization and application of short text classification model[J].Application Research of Computer，2021，38（6）：1775-1779.
[3] 戴长松，王永滨，王琦.基于在线主题模型的新闻热点演化模型分析[J].软件导刊，2020，19（1）：84-88.
DAI C S，WANG Y B，WANG Q.Analysis of news hotspot evolution model based on online topic model[J].Software Guide，2020，19（1）：84-88.
[4] 唐晓波，顾娜，谭明亮.基于句子主题发现的中文多文档自动摘要研究[J].情报科学，2020，38（3）：11-16.
TANG X B，GU N，TAN M L.The study of multi-documents summarization in Chinese based on sentence topic discovery[J].Information Science，2020，38（3）：11-16.
[5] LIPTON Z C.The mythos of model interpretability：In machine learning，the concept of interpretability is both important and slippery[J].Queue，2018，16（3）：31-57.
[6] MIKOLOV T，SUTSKEVER I，CHEN K，et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems，2013：3111-3119.
[7] DIENG A B，RUIZ F J R，BLEI D M.Topic modeling in embedding spaces[J].Transactions of the Association for Computational Linguistics，2020，8：439-453.
[8] DEVLIN J，CHANG M W，LEE K，et al.BERT：Bidirectional encoder representations from transformers for language understanding[J].Computation and Language，2018，23（2）：3-19.
[9] BENGIO Y，DUCHARME R，VINCENT P，et al.A neural probabilistic language model[J].Journal of Machine Learning Research，2003（3）：1137-1155.
[10] 张谦，高章敏，刘嘉勇.基于Word2vec的微博短文本分类研究[J].信息网络安全，2017（1）：57-62.
ZHANG Q，GAO Z M，LIU J Y.Research of weibo short text classification based on Word2Vec[J].Netinfo Security，2017（1）：57-62.
[11] 黄贤英，刘广峰，刘小洋，等.基于word2vec和双向LSTM的情感分类深度模型[J].计算机应用研究，2019，36（12）：3583-3587.
HUANG X Y，LIU G F，LIU X Y，et al.Sentiment classification depth model based on word2vec and bi-directional LSTM[J].Application Research of Computer，2019，36（12）：3583-3587.
[12] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[13] LIU Y，OTT M，GOYAL N，et al.Roberta：A robustly optimized BERT pretraining approach[J].arXiv：1907. 11692，2019.
[14] CUI Y，CHE W，LIU T，et al.Pre-training with whole word masking for Chinese bert[J].arXiv：1906.08101，2019.
[15] JOSHI M，CHEN D，LIU Y，et al.Spanbert：Improving pre-training by representing and predicting spans[J].Transactions of the Association for Computational Linguistics，2020，8：64-77.
[16] SU J L.Speeding up without losing points：Chinese WoBERT based on word granularity[EB/OL].（2020-09-18）.https：//kexue.fm/archives/7758.
[17] DAS R，ZAHEER M，DYER C.Gaussian LDA for topic models with word embeddings[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing，2015：795-804.
[18] NGUYEN D Q，BILLINGSLEY R，DU L，et al.Improving topic models with latent feature word representations[J].Transactions of the Association for Computational Linguistics，2015，3：299-313.
[19] LI C，WANG H，ZHANG Z，et al.Topic modeling for short texts with auxiliary word embeddings[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval，2016：165-174.
[20] 彭敏，杨绍雄，朱佳晖.基于双向LSTM语义强化的主题建模[J].中文信息学报，2018，32（4）：40-49.
PENG M，YANG S X，ZHU J H.Semantic enhanced topic modeling by bi-directional LSTM[J].Journal of Chinese Information Processing，2018，32（4）：40-49.
[21] LIU Y，LIU Z，CHUA T S，et al.Topical word embeddings[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence，2015.
[22] 张群，王红军，王伦文.词向量与LDA相融合的短文本分类方法[J].现代图书情报技术，2016（12）：27-35.
ZHANG Q，WANG H J，WANG L W.Short text classification method based on word vector and LDA[J].Modern Library and Information Technology，2016（12）：27-35.
[23] 曾庆田，胡晓慧，李超.融合主题词嵌入和网络结构分析的主题关键词提取方法[J].数据分析与知识发现，2019，3（7）：52-60.
ZENG Q T，HU X H，LI C.Keyword extraction method based on keyword embedding and network structure analysis[J].Data Analysis and Knowledge Discovery，2019，3（7）：52-60.
[24] BLEI D M，LAFFERTY J D.Dynamic topic models[C]//Proceedings of the 23rd International Conference on Machine Learning，2006：113-120.
[25] DIENG A B，RUIZ F J R，BLEI D M.The dynamic embedded topic model[J].arXiv：1907.05545，2019.
[26] WANG Y，LI J，CHAN H P，et al.Topic-aware neural keyphrase generation for social media language[J].arXiv：1906.03889，2019.
[27] KINGMA D P，BA J.ADAM：A method for stochastic optimization[J].arXiv：1412.6980，2014.
[28] BLEI D M，NG A Y，JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research，2003，3：993-1022.
[29] NEWMAN D，LAU J H，GRIESER K，et al.Automatic evaluation of topic coherence[C]//Proceedings of the NAACL Conference，2010：100-108.
[30] STEVENS K，KEGELMEYER P，ANDRZEJEWSKI D，et al.Exploring topic coherence over many models and many topics[C]//Proceedings of EMNLP，2012：952-961.
[31] RODER M，BOTH A，HINNEBURG A，et al.Exploring the space of topic coherence measures[C]//Proceedings of the Conference on Web Search and Data Mining，2015：399-408.
[32] MIMNO D，WALLACH H，TALLEY E，et al.Optimizing semantic coherence in topic models[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing，2011：262-272.