结合图片目标锚点引导的图文多模态摘要模型研究

doi:10.3778/j.issn.1002-8331.2409-0110

摘要/Abstract

摘要： 研究聚焦于多模态输入数据的核心语义分析，旨在生成融合多模态信息的文本摘要，并挑选出与文本摘要最为契合的图片作为图片摘要。当前多模态摘要领域面临两大挑战：一是文本与图片间语义相关性的量化难题，阻碍了跨模态共有关键语义的挖掘；二是源模态数据冗余度高，导致摘要内容难以精准聚焦关键信息。为应对这些挑战，创新性地提出了一种基于图片目标锚点引导的多模态图文摘要模型（multi-modal summarization model with image anchor guidance，MSM-AG）。该模型构建图片锚点选择机制，确定图片中的关键目标锚点，并据此将文本与图片模态样本划分为积极与消极两类；利用对比学习方法深化这两类样本的区分度，精选出与文本摘要高度匹配的图片摘要。在HCSCL多模态新闻数据集上的广泛实验证明，MSM-AG模型在多项文本摘要评估指标上均展现出优于现有多模态摘要模型的性能，有效解决了多模态摘要中的关键问题。

关键词: 多模态摘要, 图片目标锚点, 对比学习, 语义挖掘

Abstract: This study focuses on the core semantic analysis of multi-modal input data, aiming to generate text summaries that integrate multi-modal information and select the most relevant images as image summaries to match the text summaries.This field currently faces two major challenges: (1) The challenge of quantifying the semantic correlation between text and images hinders the semantic mining of shared key meanings across modalities. (2) The high redundancy in source modality data, which complicates the precise focus on critical information within the summary. To address these challenges, the proposed model introduces an innovative multi-modal summarization approach guided by image anchor points, named MSM-AG (multi-modal summarization model with image anchor guidance). This model constructs a mechanism for selecting image anchor points, identifies key target anchors within images, and categorizes text and image modality samples into positive and negative classes accordingly. Contrastive learning methods are employed to enhance the distinction between these categories, allowing the model to select image summaries that highly correspond with the text summaries. Extensive experiments conducted on the HCSCL multi-modal news dataset demonstrate that MSM-AG outperforms existing multi-modal summarization models across various evaluation metrics, effectively addressing fundamental challenges in multi-modal summarization.

Key words: multi-modal summarization, image anchor, contrastive learning, semantic mining

赵博文, 马廷淮. 结合图片目标锚点引导的图文多模态摘要模型研究[J]. 计算机工程与应用, 2025, 61(23): 181-194.

ZHAO Bowen, MA Tinghuai. MSM-AG: Multi-Modal Summarization Model with Image Object Anchor Guidance[J]. Computer Engineering and Applications, 2025, 61(23): 181-194.

参考文献

[1] JANGRA A, MUKHERJEE S, JATOWT A, et al. A survey on multi-modal summarization[J]. ACM Computing Surveys, 2023, 55(13): 1-36.
[2] 李群, 肖甫, 张子屹, 等. 基于空时变换网络的视频摘要生成[J]. 软件学报, 2022, 33(9): 3195-3209.
LI Q, XIAO F, ZHANG Z Y, et al. Video summarization based on spacial-temporal transform network[J]. Journal of Software, 2022, 33(9): 3195-3209.
[3] CHEN Z F, LU Z Y, RONG H, et al. Multi-modal anchor adaptation learning for multi-modal summarization[J]. Neurocomputing, 2024, 570: 127144.
[4] HE B, WANG J, QIU J L, et al. Align and attend: multimodal summarization with dual contrastive losses[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14867-14878.
[5] BAYOUDH K, KNANI R, HAMDAOUI F, et al. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets[J]. The Visual Computer, 2022, 38(8): 2939-2970.
[6] ZHANG L T, ZHANG X M, PAN J S. Hierarchical cross-modality semantic correlation learning model for multimodal summarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 11676-11684.
[7] ZHANG Z K, MENG X J, WANG Y S, et al. UniMS: a unified framework for multimodal summarization with knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 11757-11764.
[8] RONG H, CHEN Z F, LU Z Y, et al. Multization: multi-modal summarization enhanced by multi-contextually relevant and irrelevant attention alignment[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2024, 23(5): 1-29.
[9] CUI C H, LIANG X N, WU S Z, et al. Align vision-language semantics by multi-task learning for multi-modal summarization[J]. Neural Computing and Applications, 2024, 36(25): 15653-15666.
[10] FU X Y, WANG J, YANG Z L. MM-AVS: a full-scale dataset for multi-modal summarization[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2021: 5922-5926.
[11] 陈中峰, 陆振宇, 荣欢. 基于正反上下文语义对齐融合的多模态文本摘要模型[J]. 中文信息学报, 2024, 38(4): 108-119.
CHEN Z F, LU Z Y, RONG H. Multi-modal text summarization by positive and negative context alignment and fusion[J]. Journal of Chinese Information Processing, 2024, 38(4): 108-119.
[12] 刘泽宇, 马龙龙, 吴健, 等. 基于多模态神经网络的图像中文摘要生成方法[J]. 中文信息学报, 2017, 31(6): 162-171.
LIU Z Y, MA L L, WU J, et al. Chinese image captioning method based on multimodal neural network[J]. Journal of Chinese Information Processing, 2017, 31(6): 162-171.
[13] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 8748-8763.
[14] WANG Y F, ZHANG J, ZHANG B, et al. Research and implementation of Chinese couplet generation system with attention-based transformer mechanism[J]. IEEE Transactions on Computational Social Systems, 2022, 9(4): 1020-1028.
[15] HUANG Z L, WANG X G, HUANG L C, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 603-612.
[16] 张北辰, 李亮, 查正军, 等. 基于跨模态对比学习的视觉问答主动学习方法[J]. 计算机学报, 2022, 45(8): 1730-1745.
ZHANG B C, LI L, ZHA Z J, et al. Contrastive cross-modal representation learning based active learning for visual question answer[J]. Chinese Journal of Computers, 2022, 45(8): 1730-1745.
[17] KIM T, KANG B, RHO M, et al. A multimodal deep learning method for Android malware detection using various features[J]. IEEE Transactions on Information Forensics and Security, 2019, 14(3): 773-788.
[18] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[19] YANG Q, WU G S, LI Y H, et al. AMNN: attention-based multimodal neural network model for hashtag recommendation[J]. IEEE Transactions on Computational Social Systems, 2020, 7(3): 768-779.
[20] SUMAN C, NAMAN A, SAHA S, et al. A multimodal author profiling system for tweets[J]. IEEE Transactions on Computational Social Systems, 2021, 8(6): 1407-1416.
[21] ZHU J N, LI H R, LIU T S, et al. MSMO: multimodal summarization with multimodal output[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2018: 4154-4164.
[22] KHOSLA P, TETERWAK P, WANG C, et al. Supervised contrastive learning[C]//Advances in Neural Information Processing Systems, 2020: 18661-18673.
[23] ZHANG W W, CHEN D J, XIAO Y, et al. Semi-supervised contrast learning based on multiscale attention and multitarget contrast learning for bearing fault diagnosis[J]. IEEE Transactions on Industrial Informatics, 2023, 19(10): 10056-10068.
[24] CHUANG C Y, ROBINSON J, LIN Y C, et al. Debiased contrastive learning[C]//Advances in Neural Information Processing Systems, 2020: 8765-8775.
[25] SHU X Y, YAN S Y, YANG X, et al. ASCL: adaptive self-supervised counterfactual learning for robust visual question answering[J]. Expert Systems with Applications, 2024, 248: 123125.
[26] LI X Y, ZHAO Z J, ZHANG Y P, et al. Spectrum sensing algorithm based on self-supervised contrast learning[J]. Electronics, 2023, 12(6): 1317.
[27] JAISWAL A, BABU A R, ZADEH M Z, et al. A survey on contrastive self-supervised learning[J]. arXiv:2011.00362, 2020.
[28] KONG T, SUN F C, LIU H P, et al. FoveaBox: beyound anchor-based object detection[J]. IEEE Transactions on Image Processing, 2020, 29: 7389-7398.
[29] 陈璐, 张儒清, 郭嘉丰, 等. 面向文本摘要的反事实纠偏方法[J]. 计算机学报, 2023, 46(11):2400-2415.
CHEN L, ZHANG R Q, GUO J F, et al. Counterfactual debiasing for text summarization[J]. Chinese Journal of Computers, 2023, 46(11):2400-2415.
[30] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Proceedings of the Conference on Text Summarization Branches Out, 2004: 74-81.
[31] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[32] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and Summarization, 2005: 65-72.
[33] ANH D T, TRANG N T T. Abstractive text summarization using pointer-generator networks with pre-trained word embedding[C]//Proceedings of the 10th International Symposium on Information and Communication Technology. New York: ACM, 2019: 473-478.
[34] YAO K, ZHANG L, DU D, et al. Dual encoding for abstractive text summarization[J]. IEEE Transactions on Cybernetics, 2020, 50(3): 985-996.
[35] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017:6000-6010.
[36] ZHOU Q Y, YANG N, WEI F R, et al. Selective encoding for abstractive sentence summarization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 1095-1104.
[37] KHULLAR A, ARORA U. MAST: multimodal abstractive summarization with trimodal hierarchical attention[C]//Proceedings of the 1st International Workshop on Natural Language Processing Beyond Text. Stroudsburg: ACL, 2020: 60-69.
[38] LI H, ZHU J, LIU T, et al. Multi-modal sentence summarization with modality attention and image filtering[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018: 4152-4158.
[39] LI H R, ZHU J N, ZHANG J J, et al. Multimodal sentence summarization via multimodal selective encoding[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 5655-5667.
[40] QIU J, ZHU J, XU M, et al. MHMS: multimodal hierarchical multimedia summarization[J]. arXiv:2204.03734, 2022.
[41] ZHU J N, ZHOU Y, ZHANG J J, et al. Multimodal summarization with guidance of multimodal reference[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 9749-9756.
[42] LIANG X, CUI C, WU S, et al. Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J]. arXiv:2208.11303, 2022.
[43] LI H R, YUAN P, XU S, et al. Aspect-aware multimodal summarization for Chinese E-commerce products[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8188-8195.