计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (2): 170-178.DOI: 10.3778/j.issn.1002-8331.2309-0048

• 模式识别与人工智能 • 上一篇    下一篇

PLSGA:阶段式长文本摘要生成方法

方缙,李宝安,游新冬,吕学强   

  1. 1.北京信息科技大学 计算机学院,北京 100101
    2.北京信息科技大学 网络文化与数字传播北京市重点实验室,北京 100101
  • 出版日期:2025-01-15 发布日期:2025-01-15

PLSGA: Phase-Wise Long Text Summary Generation Approach

FANG Jin, LI Bao’an, YOU Xindong, LYU Xueqiang   

  1. 1.School of Computer Science, Beijing Information Science and Technology University,  Beijing 100101, China
    2.Beijing Key Laboratory of Cyber Culture and Digital Communication, Beijing Information Science and Technology University, Beijing 100101, China
  • Online:2025-01-15 Published:2025-01-15

摘要: 针对现有方法在处理长文本时,存在冗余信息处理困难和无法筛选出最高质量摘要的问题,提出了一种阶段式长文本摘要生成方法(PLSGA)。将样本数据的文本和参考摘要分割,利用Sentence-BERT获取语义向量并进行相似度比对,从中抽取文本的关键信息;通过关键信息和非关键信息训练抽取模型,以尽可能地保留原文本的语义信息;将抽取的关键信息和参考摘要作为样本输入骨干模型BART进行生成模型训练;通过生成模型生成多条候选摘要,并使用无参考摘要评分模型筛选出质量最好的摘要。提出的阶段式长文本摘要生成方法在多个中文长文本数据集上进行实验,结果表明相对于目前主流的方法以及ChatGPT,其效果均有提升,具有领域优势,生成的摘要质量更好,更具可读性。

关键词: 文本摘要, Sentence-BERT, 关键信息, BART, 无参考摘要评分模型

Abstract: Aiming at the problem that the existing methods have difficulty in processing redundant information and cannot select the highest quality abstract when dealing with long text, this paper proposes a staged long text abstract generation method (PLSGA). Firstly, the paper segments the text of the sample data and the reference summary, and uses Sentence-BERT to compare and extract the key information of the text. The paper trains the extraction model through key information and non-key information to retain the semantic information of the original text as much as possible. The extracted key information and reference summaries are input as samples into the backbone model BART for generative model training. Finally, multiple candidate summaries are generated through the generative model, and the best-quality summaries are selected using the no-reference summaries scoring model. The experiment proves that the proposed stage-based long text summary generation method has been tested on multiple Chinese long text data sets. The results show that compared with the current mainstream method and ChatGPT, its effect has been improved, having domain advantages, and the quality of the generated summary is much better and more readable.

Key words: text summarization, Sentence-BERT, key information, BART, no-reference summarization scoring model