计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (16): 142-145.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

结合LDA和谱聚类的多文档摘要

付  玲,张  晖   

  1. 西南科技大学 计算机科学与技术学院,四川 绵阳 621000
  • 出版日期:2013-08-15 发布日期:2013-08-15

Multi-document summary using LDA and spectral clustering

FU Ling, ZHANG Hui   

  1. School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, Sichuan 621000, China
  • Online:2013-08-15 Published:2013-08-15

摘要: 自动文摘技术的目标是致力于将冗长的文档内容压缩成较为简短的几段话,将信息全面、简洁地呈现给用户,提高用户获取信息的效率和准确率。所提出的方法在LDA(Latent Dirichlet Allocation)的基础上,使用Gibbs抽样估计主题在单词上的概率分布和句子在主题上的概率分布,结合LDA参数和谱聚类算法提取多文档摘要。该方法使用线性公式来整合句子权重,提取出字数为400字的多文档摘要。使用ROUGE自动摘要评测工具包对DUC2002数据集评测摘要质量,结果表明,该方法能有效地提高摘要的质量。

关键词: Latent Dirichlet Allocation(LDA), Gibbs抽样, 谱聚类, 多文档摘要

Abstract: Automatic summarization aims to compress lengthy document into a few short paragraphs, offers comprehensive and concise information to the users and improves the efficiency and accuracy of the information. A summarization method based on Latent Dirichlet Allocation(LDA) is proposed, using Gibbs sampling to estimate the word probability on topics and topic probability on sentences, combing with the LDA parameters and spectral clustering algorithm to extract multi-document summarization. The proposed approach uses a linear formula to integrate the sentence weights, extracting 400-words multi-document summarization. The experimental results show that the proposed method can improve the quality of summary effectively with the automatic summarization evaluation toolkit ROUGE on DUC2002.

Key words: Latent Dirichlet Allocation(LDA), Gibbs sampling, spectral clustering, multi-document summary