[1] JANGRA A, MUKHERJEE S, JATOWT A, et al. A survey on multi-modal summarization[J]. ACM Computing Surveys, 2023, 55(13): 1-36.
[2] 李群, 肖甫, 张子屹, 等. 基于空时变换网络的视频摘要生成[J]. 软件学报, 2022, 33(9): 3195-3209.
LI Q, XIAO F, ZHANG Z Y, et al. Video summarization based on spacial-temporal transform network[J]. Journal of Software, 2022, 33(9): 3195-3209.
[3] CHEN Z F, LU Z Y, RONG H, et al. Multi-modal anchor adaptation learning for multi-modal summarization[J]. Neurocomputing, 2024, 570: 127144.
[4] HE B, WANG J, QIU J L, et al. Align and attend: multimodal summarization with dual contrastive losses[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14867-14878.
[5] BAYOUDH K, KNANI R, HAMDAOUI F, et al. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets[J]. The Visual Computer, 2022, 38(8): 2939-2970.
[6] ZHANG L T, ZHANG X M, PAN J S. Hierarchical cross-modality semantic correlation learning model for multimodal summarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 11676-11684.
[7] ZHANG Z K, MENG X J, WANG Y S, et al. UniMS: a unified framework for multimodal summarization with knowledge distillation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 11757-11764.
[8] RONG H, CHEN Z F, LU Z Y, et al. Multization: multi-modal summarization enhanced by multi-contextually relevant and irrelevant attention alignment[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2024, 23(5): 1-29.
[9] CUI C H, LIANG X N, WU S Z, et al. Align vision-language semantics by multi-task learning for multi-modal summarization[J]. Neural Computing and Applications, 2024, 36(25): 15653-15666.
[10] FU X Y, WANG J, YANG Z L. MM-AVS: a full-scale dataset for multi-modal summarization[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2021: 5922-5926.
[11] 陈中峰, 陆振宇, 荣欢. 基于正反上下文语义对齐融合的多模态文本摘要模型[J]. 中文信息学报, 2024, 38(4): 108-119.
CHEN Z F, LU Z Y, RONG H. Multi-modal text summarization by positive and negative context alignment and fusion[J]. Journal of Chinese Information Processing, 2024, 38(4): 108-119.
[12] 刘泽宇, 马龙龙, 吴健, 等. 基于多模态神经网络的图像中文摘要生成方法[J]. 中文信息学报, 2017, 31(6): 162-171.
LIU Z Y, MA L L, WU J, et al. Chinese image captioning method based on multimodal neural network[J]. Journal of Chinese Information Processing, 2017, 31(6): 162-171.
[13] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 8748-8763.
[14] WANG Y F, ZHANG J, ZHANG B, et al. Research and implementation of Chinese couplet generation system with attention-based transformer mechanism[J]. IEEE Transactions on Computational Social Systems, 2022, 9(4): 1020-1028.
[15] HUANG Z L, WANG X G, HUANG L C, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 603-612.
[16] 张北辰, 李亮, 查正军, 等. 基于跨模态对比学习的视觉问答主动学习方法[J]. 计算机学报, 2022, 45(8): 1730-1745.
ZHANG B C, LI L, ZHA Z J, et al. Contrastive cross-modal representation learning based active learning for visual question answer[J]. Chinese Journal of Computers, 2022, 45(8): 1730-1745.
[17] KIM T, KANG B, RHO M, et al. A multimodal deep learning method for Android malware detection using various features[J]. IEEE Transactions on Information Forensics and Security, 2019, 14(3): 773-788.
[18] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[19] YANG Q, WU G S, LI Y H, et al. AMNN: attention-based multimodal neural network model for hashtag recommendation[J]. IEEE Transactions on Computational Social Systems, 2020, 7(3): 768-779.
[20] SUMAN C, NAMAN A, SAHA S, et al. A multimodal author profiling system for tweets[J]. IEEE Transactions on Computational Social Systems, 2021, 8(6): 1407-1416.
[21] ZHU J N, LI H R, LIU T S, et al. MSMO: multimodal summarization with multimodal output[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2018: 4154-4164.
[22] KHOSLA P, TETERWAK P, WANG C, et al. Supervised contrastive learning[C]//Advances in Neural Information Processing Systems, 2020: 18661-18673.
[23] ZHANG W W, CHEN D J, XIAO Y, et al. Semi-supervised contrast learning based on multiscale attention and multitarget contrast learning for bearing fault diagnosis[J]. IEEE Transactions on Industrial Informatics, 2023, 19(10): 10056-10068.
[24] CHUANG C Y, ROBINSON J, LIN Y C, et al. Debiased contrastive learning[C]//Advances in Neural Information Processing Systems, 2020: 8765-8775.
[25] SHU X Y, YAN S Y, YANG X, et al. ASCL: adaptive self-supervised counterfactual learning for robust visual question answering[J]. Expert Systems with Applications, 2024, 248: 123125.
[26] LI X Y, ZHAO Z J, ZHANG Y P, et al. Spectrum sensing algorithm based on self-supervised contrast learning[J]. Electronics, 2023, 12(6): 1317.
[27] JAISWAL A, BABU A R, ZADEH M Z, et al. A survey on contrastive self-supervised learning[J]. arXiv:2011.00362, 2020.
[28] KONG T, SUN F C, LIU H P, et al. FoveaBox: beyound anchor-based object detection[J]. IEEE Transactions on Image Processing, 2020, 29: 7389-7398.
[29] 陈璐, 张儒清, 郭嘉丰, 等. 面向文本摘要的反事实纠偏方法[J]. 计算机学报, 2023, 46(11):2400-2415.
CHEN L, ZHANG R Q, GUO J F, et al. Counterfactual debiasing for text summarization[J]. Chinese Journal of Computers, 2023, 46(11):2400-2415.
[30] LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]//Proceedings of the Conference on Text Summarization Branches Out, 2004: 74-81.
[31] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[32] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and Summarization, 2005: 65-72.
[33] ANH D T, TRANG N T T. Abstractive text summarization using pointer-generator networks with pre-trained word embedding[C]//Proceedings of the 10th International Symposium on Information and Communication Technology. New York: ACM, 2019: 473-478.
[34] YAO K, ZHANG L, DU D, et al. Dual encoding for abstractive text summarization[J]. IEEE Transactions on Cybernetics, 2020, 50(3): 985-996.
[35] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017:6000-6010.
[36] ZHOU Q Y, YANG N, WEI F R, et al. Selective encoding for abstractive sentence summarization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2017: 1095-1104.
[37] KHULLAR A, ARORA U. MAST: multimodal abstractive summarization with trimodal hierarchical attention[C]//Proceedings of the 1st International Workshop on Natural Language Processing Beyond Text. Stroudsburg: ACL, 2020: 60-69.
[38] LI H, ZHU J, LIU T, et al. Multi-modal sentence summarization with modality attention and image filtering[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018: 4152-4158.
[39] LI H R, ZHU J N, ZHANG J J, et al. Multimodal sentence summarization via multimodal selective encoding[C]//Proceedings of the 28th International Conference on Computational Linguistics, 2020: 5655-5667.
[40] QIU J, ZHU J, XU M, et al. MHMS: multimodal hierarchical multimedia summarization[J]. arXiv:2204.03734, 2022.
[41] ZHU J N, ZHOU Y, ZHANG J J, et al. Multimodal summarization with guidance of multimodal reference[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 9749-9756.
[42] LIANG X, CUI C, WU S, et al. Modeling paragraph-level vision-language semantic alignment for multi-modal summarization[J]. arXiv:2208.11303, 2022.
[43] LI H R, YUAN P, XU S, et al. Aspect-aware multimodal summarization for Chinese E-commerce products[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 8188-8195. |