计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (25): 132-136.

• 数据库、信号与信息处理 • 上一篇    下一篇

主题信息的中文多文档自动文摘系统

王红玲1,2,张明慧1,2,周国栋1,2   

  1. 1.苏州大学 计算机科学与技术学院,江苏 苏州 215002
    2.苏州大学 江苏省计算机信息处理技术重点实验室,江苏 苏州 215002
  • 出版日期:2012-09-01 发布日期:2012-08-30

Chinese multi-document summarization system based on topic information

WANG Hongling1,2, ZHANG Minghui1,2, ZHOU Guodong1,2   

  1. 1.School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215002, China
    2.Jiangsu Provincial Key Laboratory of Computer Information Processing Technology, Soochow University, Suzhou, Jiangsu 215002, China
  • Online:2012-09-01 Published:2012-08-30

摘要: 多文档自动文摘能够帮助人们自动、快速地获取信息,使用主题模型构建多文档自动文摘系统是一种新的尝试,其中主题模型采用浅层狄利赫雷分配(LDA)。该模型是一个多层的产生式概率模型,能够检测文档中的主题分布。使用LDA为多文档集合建模,通过计算句子在不同主题上的概率分布之间的相似度作为句子的重要度,并根据句子重要度进行文摘句的抽取。实验结果表明,该方法所得到的文摘性能优于传统的文摘方法。

关键词: 中文自动文摘, 浅层狄利赫雷分配(LDA), 主题模型, 多文档

Abstract: Multi-document summarization can help people access to information automatically and fast. Chinese multi-document summarization based on topic model is a new attempt. The LDA(Latent Dirichlet Allocation) model is a multi-level generative probabilistic model, can detect the topic distribution of the document. In the method, it models the document using LDA, then calculates the distance between a sentence and the given multi-documents via their topic probability distributions as the weight of the sentence. The paper extracts sentences according to the weight of the sentence. Experimental results show that the performance is a clear superiority over the traditional method under the proposed evaluation scheme.

Key words: automatic document summarization, Latent Dirichlet Allocation(LDA), topic model, multi-document