引入单模态监督对比学习的多视图讽刺检测

doi:10.3778/j.issn.1002-8331.2407-0007

摘要/Abstract

摘要： 社交媒体上图像和文本数据的快速增长导致人们对多模态讽刺检测问题的关注不断提高。然而，现有基于特征提取融合的检测方法存在一些缺陷：一是大多数方法缺乏多模态检测所需的底层模态对齐能力，二是模态融合过程忽视了模态间的动态关系，三是未能充分利用模态互补性。为此，提出一种基于单模态监督对比学习、多模态融合和多视图聚合预测的检测模型。以CLIP（contrastive language image pre-training）模型作为编码器来增强图像和文本底层编码的对齐效果。结合单模态监督对比学习方法，通过单模态预测来指导模态间的动态关系。然后，设计了全局-局部跨模态融合方法，利用每种模态的语义级表示作为全局多模态上下文与局部单模态特征进行交互，通过多个跨模态融合层提高模态融合效果，并减少了以往局部-局部跨模态融合方法的时间和空间成本。采用多视图聚合预测方法充分利用图像、文本和图文视图的互补性。总之，该模型能有效捕捉多模态讽刺数据的跨模态语义不一致性，在公开数据集MSD上取得了比现有最好方法DMSD-Cl更好的结果。

关键词: 讽刺检测, 多模态, 对比学习, 跨模态融合

Abstract: The rapid growth of image and text data on social media has led to an increasing interest in the problem of multimodal sarcasm detection. However, existing detection methods based on feature fusion have some shortcomings: firstly, most methods lack the necessary underlying modality alignment capability for multimodal detection; secondly, the process of modality fusion overlooks the dynamic relationships between modalities; and thirdly, they fail to fully exploit modality complementarity. To address these issues, a detection model based on uni-modal supervised contrastive learning, multimodal fusion, and multi-view aggregation prediction is proposed. Firstly, the CLIP (contrastive language-image pre-training) model is used as an encoder to enhance the alignment of image and text encodings. Secondly, by incorporating uni-modal supervised contrastive learning, the dynamic relationships between modalities are guided by uni-modal predictions. Next, a global-local cross-modal fusion method is designed, utilizing the semantic-level representations of each modality as global multimodal context to interact with local uni-modality features. This is achieved through multiple cross-modal fusion layers to enhance the fusion effect, reducing the time and space costs of previous local-local cross-modal fusion methods. Finally, a multi-view aggregation prediction method is employed to fully leverage the complementarity of image, text, and image-text views. In conclusion, this model effectively captures the cross-modal semantic inconsistencies in multimodal sarcasm data and outperforms the existing best method, DMSD-Cl, on the public dataset MSD.

Key words: sarcasm detection, multimodal, contrastive learning, cross-modal fusion

张政, 刘金硕, 邓娟, 王丽娜. 引入单模态监督对比学习的多视图讽刺检测[J]. 计算机工程与应用, 2025, 61(19): 118-126.

ZHANG Zheng, LIU Jinshuo, DENG Juan, WANG Lina. Multi-View Sarcasm Detection with Uni-Modal Supervised Contrastive Learning[J]. Computer Engineering and Applications, 2025, 61(19): 118-126.

参考文献

[1] SCHIFANELLA R, DE JUAN P, TETREAULT J, et al. Detecting sarcasm in multimodal social platforms[C]//Proceedings of the 24th ACM International Conference on Multimedia. New York: ACM, 2016: 1136-1145.
[2] CAI Y T, CAI H Y, WAN X J. Multi-modal sarcasm detection in twitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 2506-2515.
[3] PAN H L, LIN Z, FU P, et al. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: ACL, 2020: 1383-1392.
[4] ZHANG M, CHANG K, WU Y. Multi-modal semantic understanding with contrastive cross-modal feature alignment[C]//Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024: 11934-11943.
[5] LIANG B, LOU C W, LI X, et al. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4707-4715.
[6] LIANG B, LOU C, LI X, et al. Multi-modal sarcasm detection via cross-modal graph convolutional network[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022: 1767-1777.
[7] LIU H, WANG W, LI H. Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022: 4995-5006.
[8] YUE T, MAO R, WANG H, et al. KnowleNet: knowledge fusion network for multimodal sarcasm detection[J]. Information Fusion, 2023, 100: 101921.
[9] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019.
[10] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[11] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 8748-8763.
[12] WANG X Y, SUN X W, YANG T, et al. Building a bridge: a method for image-text sarcasm detection without pretraining on image-text data[C]//Proceedings of the First International Workshop on Natural Language Processing Beyond Text. Stroudsburg: ACL, 2020: 19-29.
[13] XU N, ZENG Z X, MAO W J. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3777-3786.
[14] 李丽, 李平. 基于交互图神经网络的方面级多模态情感分析[J]. 计算机应用研究, 2023, 40(12): 3683-3689.
LI L, LI P. Aspect-level multimodal sentiment analysis based on interaction graph neural network[J]. Application Research of Computers, 2023, 40(12): 3683-3689.
[15] 胡文彬, 陈龙, 黄贤波, 等. 融合交叉注意力的突发事件多模态中文反讽识别模型[J]. 智能系统学报, 2024, 19(2): 392-400.
HU W B, CHEN L, HUANG X B, et al. A multimodal Chinese sarcasm detection model for emergencies based on cross attention[J]. CAAI Transactions on Intelligent Systems, 2024, 19(2): 392-400.
[16] 林洁霞, 朱小栋. CMHICL: 基于跨模态分层交互网络和对比学习的多模态讽刺检测[J]. 计算机应用研究, 2024, 41(9): 2620-2627.
LIN J X, ZHU X D. CMHICL: multi-modal sarcasm detection with cross-modal hierarchical interaction network and contrastive learning[J]. Application Research of Computers, 2024, 41(9): 2620-2627.
[17] HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 9726-9735.
[18] XIONG L, XIONG C, LI Y, et al. Approximate nearest neighbor negative contrastive learning for dense text retrieval[C]//Proceedings of the International Conference on Learning Representations, 2020.
[19] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the International Conference on Machine Learning, 2020: 1597-1607.
[20] ZHANG D J, NAN F, WEI X K, et al. Supporting clustering with contrastive learning[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2021: 5419-5430.
[21] BHATTACHARJEE D, ZHANG T, SüSSTRUNK S, et al. MuIT: an end-to-end multitask learning transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 12021-12031.
[22] LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: vision and language representation learning with momentum distillation[C]//Advances in Neural Information Processing Systems, 2021: 9694-9705.
[23] ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 833-842.
[24] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//Proceedings of the 7th International Conference on Learning Representations, 2019.
[25] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[C]//Proceedings of the 9th International Conference on Learning Representations, 2021.
[26] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1746-1751.
[27] XIONG T, ZHANG P R, ZHU H B, et al. Sarcasm detection with self-matching networks and low-rank bilinear pooling[C]//Proceedings of the World Wide Web Conference. New York: ACM, 2019: 2115-2124.
[28] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[J]. arXiv:1907.11692, 2019.
[29] JIA M, XIE C, JING L. Debiasing multimodal sarcasm detection with contrastive learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2024.