
Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (23): 181-194.DOI: 10.3778/j.issn.1002-8331.2409-0110
• Pattern Recognition and Artificial Intelligence • Previous Articles Next Articles
ZHAO Bowen, MA Tinghuai
Online:2025-12-01
Published:2025-12-01
赵博文,马廷淮
ZHAO Bowen, MA Tinghuai. MSM-AG: Multi-Modal Summarization Model with Image Object Anchor Guidance[J]. Computer Engineering and Applications, 2025, 61(23): 181-194.
赵博文, 马廷淮. 结合图片目标锚点引导的图文多模态摘要模型研究[J]. 计算机工程与应用, 2025, 61(23): 181-194.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2409-0110
| [1] JANGRA A, MUKHERJEE S, JATOWT A, et al. A survey on multi-modal summarization[J]. ACM Computing Surveys, 2023, 55(13): 1-36. [2] 李群, 肖甫, 张子屹, 等. 基于空时变换网络的视频摘要生成[J]. 软件学报, 2022, 33(9): 3195-3209. LI Q, XIAO F, ZHANG Z Y, et al. Video summarization based on spacial-temporal transform network[J]. Journal of Software, 2022, 33(9): 3195-3209. [3] CHEN Z F, LU Z Y, RONG H, et al. Multi-modal anchor adaptation learning for multi-modal summarization[J]. Neurocomputing, 2024, 570: 127144. [4] HE B, WANG J, QIU J L, et al. Align and attend: multimodal summarization with dual contrastive losses[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14867-14878. [5] BAYOUDH K, KNANI R, HAMDAOUI F, et al. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets[J]. The Visual Computer, 2022, 38(8): 2939-2970. [6] ZHANG L T, ZHANG X M, PAN J S. Hierarchical cross-modality semantic correlation learning model for multimodal summarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 11676-11684. [7] ZHANG Z K, MENG X J, WANG Y S, et al. UniMS: a unified framework for multimodal summarization with knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 11757-11764. [8] RONG H, CHEN Z F, LU Z Y, et al. Multization: multi-modal summarization enhanced by multi-contextually relevant and irrelevant attention alignment[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2024, 23(5): 1-29. [9] CUI C H, LIANG X N, WU S Z, et al. Align vision-language semantics by multi-task learning for multi-modal summarization[J]. Neural Computing and Applications, 2024, 36(25): 15653-15666. [10] FU X Y, WANG J, YANG Z L. MM-AVS: a full-scale dataset for multi-modal summarization[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2021: 5922-5926. [11] 陈中峰, 陆振宇, 荣欢. 基于正反上下文语义对齐融合的多模态文本摘要模型[J]. 中文信息学报, 2024, 38(4): 108-119. CHEN Z F, LU Z Y, RONG H. Multi-modal text summarization by positive and negative context alignment and fusion[J]. Journal of Chinese Information Processing, 2024, 38(4): 108-119. [12] 刘泽宇, 马龙龙, 吴健, 等. 基于多模态神经网络的图像中文摘要生成方法[J]. 中文信息学报, 2017, 31(6): 162-171. LIU Z Y, MA L L, WU J, et al. Chinese image captioning method based on multimodal neural network[J]. Journal of Chinese Information Processing, 2017, 31(6): 162-171. [13] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 8748-8763. [14] WANG Y F, ZHANG J, ZHANG B, et al. Research and implementation of Chinese couplet generation system with attention-based transformer mechanism[J]. IEEE Transactions on Computational Social Systems, 2022, 9(4): 1020-1028. [15] HUANG Z L, WANG X G, HUANG L C, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 603-612. [16] 张北辰, 李亮, 查正军, 等. 基于跨模态对比学习的视觉问答主动学习方法[J]. 计算机学报, 2022, 45(8): 1730-1745. ZHANG B C, LI L, ZHA Z J, et al. Contrastive cross-modal representation learning based active learning for visual question answer[J]. Chinese Journal of Computers, 2022, 45(8): 1730-1745. [17] KIM T, KANG B, RHO M, et al. A multimodal deep learning method for Android malware detection using various features[J]. IEEE Transactions on Information Forensics and Security, 2019, 14(3): 773-788. [18] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131. [19] YANG Q, WU G S, LI Y H, et al. AMNN: attention-based multimodal neural network model for hashtag recommendation[J]. IEEE Transactions on Computational Social Systems, 2020, 7(3): 768-779. [20] SUMAN C, NAMAN A, SAHA S, et al. A multimodal author profiling system for tweets[J]. IEEE Transactions on Computational Social Systems, 2021, 8(6): 1407-1416. [21] ZHU J N, LI H R, LIU T S, et al. MSMO: multimodal summarization with multimodal output[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2018: 4154-4164. [22] KHOSLA P, TETERWAK P, WANG C, et al. Supervised contrastive learning[C]//Advances in Neural Information Processing Systems, 2020: 18661-18673. [23] ZHANG W W, CHEN D J, XIAO Y, et al. Semi-supervised contrast learning based on multiscale attention and multitarget contrast learning for bearing fault diagnosis[J]. IEEE Transactions on Industrial Informatics, 2023, 19(10): 10056-10068. [24] CHUANG C Y, ROBINSON J, LIN Y C, et al. Debiased contrastive learning[C]//Advances in Neural Information Processing Systems, 2020: 8765-8775. [25] SHU X Y, YAN S Y, YANG X, et al. ASCL: adaptive self-supervised counterfactual learning for robust visual question answering[J]. Expert Systems with Applications, 2024, 248: 123125. [26] LI X Y, ZHAO Z J, ZHANG Y P, et al. Spectrum sensing algorithm based on self-supervised contrast learning[J]. Electronics, 2023, 12(6): 1317. [27] JAISWAL A, BABU A R, ZADEH M Z, et al. A survey on contrastive self-supervised learning[J]. arXiv:2011.00362, 2020. [28] KONG T, SUN F C, LIU H P, et al. FoveaBox: beyound anchor-based object detection[J]. IEEE Transactions on Image Processing, 2020, 29: 7389-7398. [29] 陈璐, 张儒清, 郭嘉丰, 等. 面向文本摘要的反事实纠偏方法[J]. 计算机学报, 2023, 46(11):2400-2415. CHEN L, ZHANG R Q, GUO J F, et al. Counterfactual debiasing for text summarization[J]. Chinese Journal of Computers, 2023, 46(11):2400-2415. [30] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Proceedings of the Conference on Text Summarization Branches Out, 2004: 74-81. [31] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318. [32] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and Summarization, 2005: 65-72. [33] ANH D T, TRANG N T T. Abstractive text summarization using pointer-generator networks with pre-trained word embedding[C]//Proceedings of the 10th International Symposium on Information and Communication Technology. New York: ACM, 2019: 473-478. [34] YAO K, ZHANG L, DU D, et al. Dual encoding for abstractive text summarization[J]. IEEE Transactions on Cybernetics, 2020, 50(3): 985-996. [35] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017:6000-6010. [36] ZHOU Q Y, YANG N, WEI F R, et al. Selective encoding for abstractive sentence summarization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 1095-1104. [37] KHULLAR A, ARORA U. MAST: multimodal abstractive summarization with trimodal hierarchical attention[C]//Proceedings of the 1st International Workshop on Natural Language Processing Beyond Text. Stroudsburg: ACL, 2020: 60-69. [38] LI H, ZHU J, LIU T, et al. Multi-modal sentence summarization with modality attention and image filtering[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018: 4152-4158. [39] LI H R, ZHU J N, ZHANG J J, et al. Multimodal sentence summarization via multimodal selective encoding[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 5655-5667. [40] QIU J, ZHU J, XU M, et al. MHMS: multimodal hierarchical multimedia summarization[J]. arXiv:2204.03734, 2022. [41] ZHU J N, ZHOU Y, ZHANG J J, et al. Multimodal summarization with guidance of multimodal reference[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 9749-9756. [42] LIANG X, CUI C, WU S, et al. Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J]. arXiv:2208.11303, 2022. [43] LI H R, YUAN P, XU S, et al. Aspect-aware multimodal summarization for Chinese E-commerce products[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8188-8195. |
| [1] | WU Xia, WANG Shaoqing, ZHANG Yao. Cross-View Contrastive Model for User?Multi-Behavior Recommendation [J]. Computer Engineering and Applications, 2025, 61(6): 244-253. |
| [2] | DUAN Keke, ZHENG Junrong, YAN Ze. Unsupervised Tracking Combining Moving Object Discovery and Contrastive Learning [J]. Computer Engineering and Applications, 2025, 61(4): 141-149. |
| [3] | JI Yihao, REN Yizhi, YUAN Lifeng, LIU Rongke, PAN Gaoning. Event Type Induction Combined with Contrastive Learning and Iterative Optimization [J]. Computer Engineering and Applications, 2025, 61(3): 196-211. |
| [4] | YU Bengong, SHI Zhongyu. Deep Attention and Two-Stage Fusion of Image-Text Sentiment Contrastive Learning Method [J]. Computer Engineering and Applications, 2025, 61(3): 223-233. |
| [5] | SUN Shuaiqi, WEI Guiying, WU Sen. Sentiment Analysis Method Integrating Hypergraph Enhancement and Dual Contrastive Learning [J]. Computer Engineering and Applications, 2025, 61(22): 137-147. |
| [6] | LIU Jingxiang, WANG Feng, WEI Wei. Multi-View Contrastive Learning for Recommendation with Meta-Knowledge and SVD [J]. Computer Engineering and Applications, 2025, 61(22): 159-169. |
| [7] | BAI Tian, GAO Yuehong, XIE Zhengguang, LI Hongjun. Multimodal Cross-View Contrastive Memory-Augmented Network for Self-Supervised Skeleton-Based Action Recognition [J]. Computer Engineering and Applications, 2025, 61(21): 225-233. |
| [8] | DENG Haowen, WANG Hengsheng. Image Classification for Film Surface Defect Based on Contrastive Learning and Diffusion Model [J]. Computer Engineering and Applications, 2025, 61(21): 242-252. |
| [9] | HE Qixiang, GUO Hongyu, CHEN Qizhi, LIU Yulong. Unsupervised Semantic Segmentation Based on Object-Aware Semantic Cues [J]. Computer Engineering and Applications, 2025, 61(20): 218-227. |
| [10] | ZHANG Zheng, LIU Jinshuo, DENG Juan, WANG Lina. Multi-View Sarcasm Detection with Uni-Modal Supervised Contrastive Learning [J]. Computer Engineering and Applications, 2025, 61(19): 118-126. |
| [11] | REN Yandong, ZHANG Dong, LI Guanyu. Contrastive Learning with Integrated Attention and Structure Denoising in Knowledge-Aware Recommendation Algorithms [J]. Computer Engineering and Applications, 2025, 61(17): 232-240. |
| [12] | XIAO Cimei, JIANG Ailian, JI Wei, GAO Feng. Mask Reconstruction Fused with Contrastive Learning for Self-Supervised Medical Image Segmentation [J]. Computer Engineering and Applications, 2025, 61(15): 298-309. |
| [13] | ZHAO Hong, WANG He, LI Wengai. Research on Improving Text-to-Image Generation Method Through Contrastive Learning [J]. Computer Engineering and Applications, 2025, 61(14): 264-273. |
| [14] | HANG Tingting, GUO Ya, LI Desheng, FENG Jun. Survey on Research of Continual Relation Extraction Methods [J]. Computer Engineering and Applications, 2025, 61(14): 1-19. |
| [15] | WANG Aofei, SUN Fuzhen, SUN Xiujuan, ZHANG Wenxuan, WANG Shaoqing. Diffusion-Augmented Multi-View Intent Contrastive Learning Method for Sequential Recommendation [J]. Computer Engineering and Applications, 2025, 61(13): 338-348. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||