ADMIC：基于交替主导模态内在相关性的多模态隐喻检测

doi:10.3778/j.issn.1002-8331.2411-0130

摘要/Abstract

摘要： 多模态隐喻检测依赖非文本模态的信息补充文本模态，而当前检测模型往往过于关注文本模态，忽略了视觉模态在捕捉隐喻语义中的关键作用，并且缺乏有效的模态交互融合策略。为解决上述问题，提出了一种基于交替主导模态内在相关性的多模态深度神经网络模型（alternating dominant modalities intrinsic correlation，ADMIC）。该模型在文本模态和图像模态间交替指定主导模态，通过调整主导模态的方式，充分挖掘两种模态的内在关联与互补特性，实现隐喻信息的双向交互融合。与固定主导模态的融合方法相比，交替主导模态策略能够更灵活地适应隐喻语义在不同模态间的分布，提高了隐喻检测的性能和泛化能力。设计了多级融合策略，进一步提升了模态融合的效果。在Met-meme数据集上的实验结果表明，ADMIC在英文、中文及双语数据集上的F1-score分别提升了1.32、3.15和2.29个百分点，显著优于传统方法。实验结果验证了该模型在捕捉隐喻语义方面的优势，同时在多模态讽刺数据集上的良好表现也证明了其泛化能力。

关键词: 多模态, 隐喻检测, 特征融合, 注意力机制, Met-meme

Abstract: Multimodal metaphor detection relies on non-textual modalities to supplement textual modalities, while current detection models often focus too much on textual modalities, neglecting the crucial role of visual modalities in capturing metaphorical semantics, and lacking effective modal interaction and fusion strategies. To address the aforementioned issues, this paper proposes a multimodal deep neural network model based on alternating dominant modalities intrinsic correlation (ADMIC). This model alternately specifies the dominant modality between the text modality and the image modality, and by adjusting the dominant modality, fully explores the inherent correlation and complementary characteristics of the two modalities, achieving bidirectional interaction and fusion of metaphorical information. Compared with the fusion method of fixed dominant modalities, the alternating dominant modality strategy can more flexibly adapt to the distribution of metaphorical semantics between different modalities, improving the performance and generalization ability of metaphor detection. This paper proposes a multi-level fusion strategy to further enhance the effectiveness of modal fusion. Experiments on the Met-meme dataset show that ADMIC improves F1-score by 1.32, 3.15, and 2.29 percentage points on English, Chinese, and bilingual datasets, respectively, significantly outperforming traditional methods. The experimental results validate the advantages of the model in capturing metaphorical semantics, and its good performance on multimodal satire datasets also demonstrates its generalization ability.

Key words: multimodal, metaphor detection, feature fusion, attention mechanism, Met-meme

郭世松, 杨启萌, 闫远播, 何晓宇. ADMIC：基于交替主导模态内在相关性的多模态隐喻检测[J]. 计算机工程与应用, 2025, 61(22): 148-158.

GUO Shisong, YANG Qimeng, YAN Yuanbo, HE Xiaoyu. ADMIC： Multimodal Metaphor Detection Based on Intrinsic Correlation of Alternating Dominant Modalities[J]. Computer Engineering and Applications, 2025, 61(22): 148-158.

参考文献

[1] LAKOFF G, JOHNSON M. Metaphors we live by[M]. Chicago: University of Chicago Press, 2003.
[2] FASS D. Met*: a method for discriminating metonymy and metaphor by computer[J]. Computational Linguistics, 1991, 17(1): 49-90.
[3] SHUTOVA E, KIELA D, MAILLARD J. Black holes and white rabbits: metaphor identification with visual features[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2016: 160-170.
[4] SU C, CHEN W J, FU Z, et al. Multimodal metaphor detection based on distinguishing concreteness[J]. Neurocomputing, 2021, 429: 166-173.
[5] KEHAT G, PUSTEJOVSKY J. Improving neural metaphor detection with visual datasets[C]//Proceedings of the 12th International Conference on Language Resources and Evaluation, 2020: 5928-5933.
[6] XU B, LI T T, ZHENG J Z, et al. MET-Meme: a multimodal meme dataset rich in metaphors[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 2887-2899.
[7] CAI Y T, CAI H Y, WAN X J. Multi-modal sarcasm detection in Twitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 2506-2515.
[8] HEINTZ I, GABBARD R, SRIVASTAVA M, et al. Automatic extraction of linguistic metaphors with LDA topic modeling[C]//Proceedings of the 1st Workshop on Metaphor in NLP, 2013: 58-66.
[9] K?PER M, SCHULTE IM WALDE S. Improving verb metaphor detection by propagating abstractness to words, phrases and individual senses[C]//Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and Their Applications. Stroudsburg: ACL, 2017: 24-30.
[10] STRZALKOWSKI T, BROADWELL G A, TAYLOR S, et al. Robust extraction of metaphor from novel data[C]//Proceedings of the 1st Workshop on Metaphor in NLP, 2013: 67-76.
[11] SHUTOVA E, SUN L, GUTIéRREZ E D, et al. Multilingual metaphor processing: experiments with semi-supervised and unsupervised learning[J]. Computational Linguistics, 2017, 43(1): 71-123.
[12] MAO R, LIN C H, GUERIN F. Word embedding and WordNet based metaphor identification and interpretation[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 1222-1231.
[13] PRAMANICK M, MITRA P. Unsupervised detection of metaphorical adjective-noun pairs[C]//Proceedings of the 2018 Workshop on Figurative Language Processing. Stroudsburg: ACL, 2018: 76-80.
[14] REI M, BULAT L, KIELA D, et al. Grasping the finer point: a supervised similarity network for metaphor detection[J]. arXiv:1709.00575, 2017.
[15] BIZZONI Y, GHANIMIFARD M. Bigrams and BiLSTMs two neural networks for sequential metaphor detection[C]//Proceedings of the 2018 Workshop on Figurative Language Processing. Stroudsburg: ACL, 2018: 91-101.
[16] TANASESCU C, KESARWANI V, INKPEN D. Metaphor detection by deep learning and the place of poetic metaphor in digital humanities[C]//Proceedings of the 31st International Florida Artificial Intelligence Research Society Conference, 2018: 122-127.
[17] FORCEVILLE C. Multimodal metaphor[M]. Berlin: Mouton de Gruyter, 2009.
[18] VASWANI A. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[19] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[20] KUMAR S, KULKARNI A, AKHTAR M S, et al. When did you become so smart, oh wise one?! sarcasm explanation in multi-modal multi-party dialogues[J]. arXiv:2203.06419, 2022.
[21] DEVLIN J. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[22] SUN L C, LIAN Z, LIU B, et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2024, 15(1): 309-325.
[23] LEI BA J, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016.
[24] PASZKE A, GROSS S, MASSA F, et al. Pytorch: an imperative style, high-performance deep learning library[C]//Adv-ances in Neural Information Processing Systems 32, 2019.
[25] KINGMA D P. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[26] LEWIS M. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv:1910.13461, 2019.
[27] CLARK K. Electra: pre-training text encoders as discriminators rather than generators[J]. arXiv:2003.10555, 2020.
[28] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[29] CHEN X, ZHANG N Y, LI L, et al. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 904-915.
[30] TSAI Y H, BAI S J, PU LIANG P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6558-6569.
[31] YANG B, SHAO B, WU L J, et al. Multimodal sentiment analysis with unidirectional modality translation[J]. Neurocomputing, 2022, 467: 130-137.
[32] XU N, ZENG Z X, MAO W J. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3777-3786.
[33] PAN H L, LIN Z, FU P, et al. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: ACL, 2020: 1383-1392.
[34] WANG X Y, SUN X W, YANG T, et al. Building a bridge: a method for image-text sarcasm detection without pretraining on image-text data[C]//Proceedings of the 1st International Workshop on Natural Language Processing Beyond Text. Stroudsburg: ACL, 2020: 19-29.
[35] RAFFEL C, SHAZEER N M, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21: 140.
[36] DOSOVITSKIY A. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[37] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 10347-10357.
[38] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8748-8763.
[39] KIM W, SON B, KIM I. Vilt: vision-and-language transformer without convolution or region supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 5583-5594.