基于LDA模型和Doc2vec的学术摘要聚类方法

doi:10.3778/j.issn.1002-8331.1905-0148

计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (6): 180-185.DOI: 10.3778/j.issn.1002-8331.1905-0148

基于LDA模型和Doc2vec的学术摘要聚类方法

张卫卫，胡亚琦，翟广宇，刘志鹏

1.兰州交通大学电子与信息工程学院，兰州 730070
2.兰州理工大学经济管理学院，兰州 730050

出版日期:2020-03-15 发布日期:2020-03-13

Academic Abstract Clustering Method Based on LDA Model and Doc2vec

ZHANG Weiwei, HU Yaqi, ZHAI Guangyu, LIU Zhipeng

1.School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
2.School of Economics and Management, Lanzhou University of Technology, Lanzhou 730050, China

Online:2020-03-15 Published:2020-03-13

摘要/Abstract

摘要：

针对特定任务下的短文本聚类已经成为文本数据挖掘的一项重要任务。学术摘要文本由于数据稀疏造成了聚类结果准确率低、语义鸿沟问题，狭窄的域导致大量无关紧要的单词重叠，使得很难区分主题和细粒度集群。鉴于此，提出一种新的聚类模型——主题句向量模型（Doc2vec-LDA，Doc-LDA），该模型通过将LDA主题模型（Latent Dirichlet Allocation）和句向量模型融合（Doc2vec），不仅使得在模型训练过程中既能利用整个语料库的信息，而且还利用Paragraph Vector的局部语义空间信息完善LDA的隐性语义信息。实验采用爬取到的知网摘要文本作为数据集，选用[K]-Means聚类算法对各模型的摘要文本进行效果比较。实验结果表明，基于Doc-LDA模型的聚类效果优于LDA、Word2vec、LDA+Word2vec模型。

关键词: 短文本聚类, LDA模型, Doc2vec模型, 学术摘要

Abstract:

Short text clustering for specific topics has become an important task in text data mining. The academic abstract text has poor stability of clustering results and semantic gap due to sparse data. Narrow domain leads to a large number of inconsequential word overlaps and making it hard to distinguish between topics and fine-grained clusters. In view of this, this paper proposes a novel clustering model called Topic Paragraph Vector model（Doc2vec-LDA, Doc-LDA）. By merging LDA topic model（Latent Dirichlet Allocation）and the Paragraph vector model（Doc2vec）, the model not only makes use of the information of the entire corpus in the model training process, but also uses the local semantic space information of Paragraph Vector to improve the implicit semantic information of LDA. Crawling academic abstracts from CNKI as experimental data sets, [K]-Means clustering algorithm is used to compare the abstract texts of each model. The experimental results show that the clustering effect based on Doc-LDA model is better than LDA, Word2vec and LDA+Word2vec models.

Key words: short text clustering, Latent Dirichlet Allocation（LDA） model, Doc2vec model, academic abstract

张卫卫，胡亚琦，翟广宇，刘志鹏. 基于LDA模型和Doc2vec的学术摘要聚类方法[J]. 计算机工程与应用, 2020, 56(6): 180-185.

ZHANG Weiwei, HU Yaqi, ZHAI Guangyu, LIU Zhipeng. Academic Abstract Clustering Method Based on LDA Model and Doc2vec[J]. Computer Engineering and Applications, 2020, 56(6): 180-185.

[1]	胡璨，崔晓晖. 社交网络用户发布模式和兴趣预测研究[J]. 计算机工程与应用, 2020, 56(9): 99-105.
[2]	柏志安1，曾剑平2. 基于重叠度与完整度的LDA主题优选方法[J]. 计算机工程与应用, 2019, 55(12): 155-161.
[3]	王红，张昊，史金钏. 基于LDA的领域本体概念获取方法研究[J]. 计算机工程与应用, 2018, 54(13): 252-257.
[4]	曹洁1，2，罗菊香1，李晓旭1. 融入类别信息的图像标注概率主题模型[J]. 计算机工程与应用, 2017, 53(10): 187-192.
[5]	邱云飞，赵彬，林明明，王伟. 结合语义改进的K-means短文本聚类算法[J]. 计算机工程与应用, 2016, 52(19): 78-83.
[6]	石晶¹,李万龙^1,2. 三种主题分割方法的对比研究[J]. 计算机工程与应用, 2009, 45(18): 135-138.

基于LDA模型和Doc2vec的学术摘要聚类方法

Academic Abstract Clustering Method Based on LDA Model and Doc2vec

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics