Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (18): 130-134.

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Using topic and sub-event discover to extract multi-document summarization

WANG Meng1,LI Chungui1,XU Chao2,HE Tingting3   

  1. 1.Department of Computer Engineering,Guangxi University of Technology,Liuzhou,Guangxi 545006,China
    2.Faculty of Software,Fujian Normal University,Fuzhou 350007,China
    3.Department of Computer Science,Huazhong Normal University,Wuhan 430079,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-06-21 Published:2011-06-21

主题与子事件发现的多文档自动文摘

王 萌1,李春贵1,徐 超2,何婷婷3   

  1. 1.广西工学院 计算机工程系,广西 柳州 545006
    2.福建师范大学 软件学院,福州 350007
    3.华中师范大学 计算机科学系,武汉 430079

Abstract: A multi-document summarization method based on topic and sub-event is proposed.The method extracts eight basic word features using the frequency,position information,word of event and topic information etc.which break through traditional statistical method,then chooses logistic regression model to compute words score.The summarizer gives a score to sentences in term of the word values,and combines score and redundancy of sentence to produce summarization.It uses three different summary systems(Coverage Baseline,Centroid-Based Summary and Word Mining based Summary(WMS)) in three aspects(N-gram co-occurrence statistics,term word coverage and high frequency word) to compare.The experimental results show the system of WMS has more effectiveness and feasibility.

Key words: deeply word mining, multi-document summarization, logistic regression model

摘要: 提出了一种基于主题与子事件抽取的多文档自动文摘方法。该方法突破传统词频统计方法,除考虑词语频率、位置信息外,还将词语是否为描述文本集合的主题和子事件作为因素,提取出了8个基本特征,利用逻辑回归模型预测基本特征对词语权重的影响,计算词语权重。通过建立句子向量空间模型给句子打分,结合句子分数和冗余度产生文摘。对N-gram同现频率、主题词覆盖率和高频词覆盖率3种不同参数,分别在Coverage Baseline、Centroid-Based Summary和Word Mining based Summary(WMS)3种不同文摘系统下所产生的文摘质量,进行了对比实验,结果表明WMS系统在多方面具有优越的性能。

关键词: 深层词语挖掘, 多文档自动文摘, 逻辑回归模型