计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (6): 180-185.DOI: 10.3778/j.issn.1002-8331.1905-0148

• 模式识别与人工智能 • 上一篇    下一篇

基于LDA模型和Doc2vec的学术摘要聚类方法

张卫卫,胡亚琦,翟广宇,刘志鹏   

  1. 1.兰州交通大学 电子与信息工程学院,兰州 730070
    2.兰州理工大学 经济管理学院,兰州 730050
  • 出版日期:2020-03-15 发布日期:2020-03-13

Academic Abstract Clustering Method Based on LDA Model and Doc2vec

ZHANG Weiwei, HU Yaqi, ZHAI Guangyu, LIU Zhipeng   

  1. 1.School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
    2.School of Economics and Management, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2020-03-15 Published:2020-03-13

摘要:

针对特定任务下的短文本聚类已经成为文本数据挖掘的一项重要任务。学术摘要文本由于数据稀疏造成了聚类结果准确率低、语义鸿沟问题,狭窄的域导致大量无关紧要的单词重叠,使得很难区分主题和细粒度集群。鉴于此,提出一种新的聚类模型——主题句向量模型(Doc2vec-LDA,Doc-LDA),该模型通过将LDA主题模型(Latent Dirichlet Allocation)和句向量模型融合(Doc2vec),不仅使得在模型训练过程中既能利用整个语料库的信息,而且还利用Paragraph Vector的局部语义空间信息完善LDA的隐性语义信息。实验采用爬取到的知网摘要文本作为数据集,选用[K]-Means聚类算法对各模型的摘要文本进行效果比较。实验结果表明,基于Doc-LDA模型的聚类效果优于LDA、Word2vec、LDA+Word2vec模型。

关键词: 短文本聚类, LDA模型, Doc2vec模型, 学术摘要

Abstract:

Short text clustering for specific topics has become an important task in text data mining. The academic abstract text has poor stability of clustering results and semantic gap due to sparse data. Narrow domain leads to a large number of inconsequential word overlaps and making it hard to distinguish between topics and fine-grained clusters. In view of this, this paper proposes a novel clustering model called Topic Paragraph Vector model(Doc2vec-LDA, Doc-LDA). By merging LDA topic model(Latent Dirichlet Allocation)and the Paragraph vector model(Doc2vec), the model not only makes use of the information of the entire corpus in the model training process, but also uses the local semantic space information of Paragraph Vector to improve the implicit semantic information of LDA. Crawling academic abstracts from CNKI as experimental data sets, [K]-Means clustering algorithm is used to compare the abstract texts of each model. The experimental results show that the clustering effect based on Doc-LDA model is better than LDA, Word2vec and LDA+Word2vec models.

Key words: short text clustering, Latent Dirichlet Allocation(LDA) model, Doc2vec model, academic abstract