Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (15): 200-206.DOI: 10.3778/j.issn.1002-8331.2009-0085

Previous Articles     Next Articles

Chinese Document-Level Summary Model — DSum-SSE

HE Junmin, LU Menghua, MENG Kui   

  1. 1.Shengli Geophysical Research Institute of China Petroleum and Chemical Corporation, Dongying, Shandong 257093, China
    2.School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
  • Online:2021-08-01 Published:2021-07-26

中文单文档摘要模型DSum-SSE

赫俊民,鲁梦华,孟魁   

  1. 1.中国石化股份有限公司 胜利油田分公司,物探研究院,山东 东营 257093
    2.上海交通大学 电子信息与电气工程学院,上海 200240

Abstract:

Text summarization technology filters out important information from the text and presents it reasonably, which can help people quickly obtain information. In the field of Chinese single-document summarization, the supervised summarization model is not mature due to the lack of reliable data sets. A Chinese document-level summary corpus—CDESD(Chinese Document-level Extractive Summarization Dataset) with a scale of more than 200,000 articles is constructed, and a supervised document-level extractive summary model—DSum-SSE(Document Summarization with SPA Sentence Embedding) is proposed. The model is based on a neural network framework, and uses a sequence-to-sequence framework that combines Pointer and attention mechanisms to solve sentence-level generative summarization problems to obtain a representation vector that reflects the core meaning of the sentence, and introduce extremes on this basis Pointer mechanism, complete the supervised document-level extractive summary algorithm. Experiments show that compared with the popular unsupervised document-level extractive summary algorithm—TextRank, DSum-SSE is capable of providing higher-quality summaries. The corpus CDESD and the model DSum-SSE complement well in the field of Chinese document level summaries.

Key words: document-level summarization, extractive summary, sequence-to-sequence, attention mechanism, Pointer

摘要:

针对中文文档摘要领域存在的缺少可靠数据集,有监督的摘要模型不成熟的问题,构建了一个规模超过20万篇的中文文档级别的摘要语料库(Chinese Document-level Extractive Summarization Dataset,CDESD),提出了一种有监督的文档级别抽取式摘要模型(Document Summarization with SPA Sentence Embedding,DSum-SSE)。该模型以神经网络为基础的框架,使用结合了Pointer和注意力机制的端到端框架解决句子级别的生成式摘要问题,以获得反映句子核心含义的表示向量,然后在此基础上引入极端的Pointer机制,完成文档级别抽取式摘要算法。实验表明,相比于无监督的单文档摘要算法——TextRank,DSum-SSE有能力提供更高质量的摘要。CDESD和DSum-SSE分别对中文文档级别摘要领域的语料数据和模型做了很好的补充。

关键词: 文档级文本摘要, 抽取式摘要, 端到端框架, 注意力机制, Pointer