计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (23): 181-194.DOI: 10.3778/j.issn.1002-8331.2409-0110

• 模式识别与人工智能 • 上一篇    下一篇

结合图片目标锚点引导的图文多模态摘要模型研究

赵博文,马廷淮   

  1. 南京信息工程大学 软件学院,南京 210044
  • 出版日期:2025-12-01 发布日期:2025-12-01

MSM-AG: Multi-Modal Summarization Model with Image Object Anchor Guidance

ZHAO Bowen, MA Tinghuai   

  1. School of Software, Nanjing University of Information Science and Technology, Nanjing 210044, China
  • Online:2025-12-01 Published:2025-12-01

摘要: 研究聚焦于多模态输入数据的核心语义分析,旨在生成融合多模态信息的文本摘要,并挑选出与文本摘要最为契合的图片作为图片摘要。当前多模态摘要领域面临两大挑战:一是文本与图片间语义相关性的量化难题,阻碍了跨模态共有关键语义的挖掘;二是源模态数据冗余度高,导致摘要内容难以精准聚焦关键信息。为应对这些挑战,创新性地提出了一种基于图片目标锚点引导的多模态图文摘要模型(multi-modal summarization model with image anchor guidance,MSM-AG)。该模型构建图片锚点选择机制,确定图片中的关键目标锚点,并据此将文本与图片模态样本划分为积极与消极两类;利用对比学习方法深化这两类样本的区分度,精选出与文本摘要高度匹配的图片摘要。在HCSCL多模态新闻数据集上的广泛实验证明,MSM-AG模型在多项文本摘要评估指标上均展现出优于现有多模态摘要模型的性能,有效解决了多模态摘要中的关键问题。

关键词: 多模态摘要, 图片目标锚点, 对比学习, 语义挖掘

Abstract: This study focuses on the core semantic analysis of multi-modal input data, aiming to generate text summaries that integrate multi-modal information and select the most relevant images as image summaries to match the text summaries.This field currently faces two major challenges: (1) The challenge of quantifying the semantic correlation between text and images hinders the semantic mining of shared key meanings across modalities. (2) The high redundancy in source modality data, which complicates the precise focus on critical information within the summary. To address these challenges, the proposed model introduces an innovative multi-modal summarization approach guided by image anchor points, named MSM-AG (multi-modal summarization model with image anchor guidance). This model constructs a mechanism for selecting image anchor points, identifies key target anchors within images, and categorizes text and image modality samples into positive and negative classes accordingly. Contrastive learning methods are employed to enhance the distinction between these categories, allowing the model to select image summaries that highly correspond with the text summaries. Extensive experiments conducted on the HCSCL multi-modal news dataset demonstrate that MSM-AG outperforms existing multi-modal summarization models across various evaluation metrics, effectively addressing fundamental challenges in multi-modal summarization.

Key words: multi-modal summarization, image anchor, contrastive learning, semantic mining