计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (1): 169-179.DOI: 10.3778/j.issn.1002-8331.2106-0048

• 模式识别与人工智能 • 上一篇    下一篇

基于BERT的嵌入式文本主题模型研究

王宇晗,林民,李艳玲,赵佳鹏   

  1. 1.内蒙古师范大学 计算机科学技术学院,呼和浩特 010022
    2.中国科学院大学 网络空间安全学院,北京 100089
    3.中国科学院 信息工程研究所,北京 100089
  • 出版日期:2023-01-01 发布日期:2023-01-01

Research on Embedded Text Topic Model Based on BERT

WANG  Yuhan, LIN Min, LI  Yanling, ZHAO Jiapeng   

  1. 1.College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China
    2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100089, China
    3.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100089, China
  • Online:2023-01-01 Published:2023-01-01

摘要: 主题模型能够从海量文本数据中挖掘语义丰富的主题词,在文本分析的相关任务中发挥着重要作用。传统LDA主题模型在使用词袋模型表示文本时,无法建模词语之间的语义和序列关系,并且忽略了停用词与低频词。嵌入式主题模型(ETM)虽然使用Word2Vec模型来表示文本词向量解决上述问题,但在处理不同语境下的多义词时,通常将其表示为同一向量,无法体现词语的上下文语义差异。针对上述问题,设计了一种基于BERT的嵌入式主题模型BERT-ETM进行主题挖掘,在国内外通用数据集和《软件工程》领域文本语料上验证了所提方法的有效性。实验结果表明,该方法能克服传统主题模型存在的不足,主题一致性、多样性明显提升,在建模一词多义问题时表现优异,尤其是结合中文分词的WoBERT-ETM,能够挖掘出高质量、细粒度的主题词,对大规模文本十分有效。

关键词: 主题模型, BERT模型, 词嵌入, 词向量可视化

Abstract: Topic model can mining topic words with rich semantics from the massive text data, and plays an important role in the related tasks of text analysis. When the traditional LDA topic model uses word-bag model to represent text, it cannot model the semantic and sequence relationship between words, and ignore the words of deactivation and low frequency. Although the embedded topic model(ETM) solves the above problems by using Word2Vec model to represent the word vector of text, it usually represents the same vector when dealing with polysemy words in different contexts, which cannot reflect the semantic differences of words. To solve the above problems, a kind of ETM based on BERT named BERT-ETM is designed to mine the topic. The effectiveness of the proposed method is verified in general datasets at home and abroad and the text corpus of software engineering. The experimental results show that the method can overcome the shortcomings of traditional topic models, and the coherence and diversity of topic are improved obviously and performs well in modeling polysemy of a word, especially WoBERT-ETM combined with Chinese word segmentation, can dig out high-quality and fine-grained topic words, which is very effective for large vocabulary.

Key words: topic model, BERT model, word embedding, word vector visualization