计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (19): 118-126.DOI: 10.3778/j.issn.1002-8331.2407-0007

• 模式识别与人工智能 • 上一篇    下一篇

引入单模态监督对比学习的多视图讽刺检测

张政,刘金硕,邓娟,王丽娜   

  1. 武汉大学 国家网络安全学院 空天信息安全与可信计算教育部重点实验室,武汉 430072
  • 出版日期:2025-10-01 发布日期:2025-09-30

Multi-View Sarcasm Detection with Uni-Modal Supervised Contrastive Learning

ZHANG Zheng, LIU Jinshuo, DENG Juan, WANG Lina   

  1. Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China
  • Online:2025-10-01 Published:2025-09-30

摘要: 社交媒体上图像和文本数据的快速增长导致人们对多模态讽刺检测问题的关注不断提高。然而,现有基于特征提取融合的检测方法存在一些缺陷:一是大多数方法缺乏多模态检测所需的底层模态对齐能力,二是模态融合过程忽视了模态间的动态关系,三是未能充分利用模态互补性。为此,提出一种基于单模态监督对比学习、多模态融合和多视图聚合预测的检测模型。以CLIP(contrastive language image pre-training)模型作为编码器来增强图像和文本底层编码的对齐效果。结合单模态监督对比学习方法,通过单模态预测来指导模态间的动态关系。然后,设计了全局-局部跨模态融合方法,利用每种模态的语义级表示作为全局多模态上下文与局部单模态特征进行交互,通过多个跨模态融合层提高模态融合效果,并减少了以往局部-局部跨模态融合方法的时间和空间成本。采用多视图聚合预测方法充分利用图像、文本和图文视图的互补性。总之,该模型能有效捕捉多模态讽刺数据的跨模态语义不一致性,在公开数据集MSD上取得了比现有最好方法DMSD-Cl更好的结果。

关键词: 讽刺检测, 多模态, 对比学习, 跨模态融合

Abstract: The rapid growth of image and text data on social media has led to an increasing interest in the problem of multimodal sarcasm detection. However, existing detection methods based on feature fusion have some shortcomings: firstly, most methods lack the necessary underlying modality alignment capability for multimodal detection; secondly, the process of modality fusion overlooks the dynamic relationships between modalities; and thirdly, they fail to fully exploit modality complementarity. To address these issues, a detection model based on uni-modal supervised contrastive learning, multimodal fusion, and multi-view aggregation prediction is proposed. Firstly, the CLIP (contrastive language-image pre-training) model is used as an encoder to enhance the alignment of image and text encodings. Secondly, by incorporating uni-modal supervised contrastive learning, the dynamic relationships between modalities are guided by uni-modal predictions. Next, a global-local cross-modal fusion method is designed, utilizing the semantic-level representations of each modality as global multimodal context to interact with local uni-modality features. This is achieved through multiple cross-modal fusion layers to enhance the fusion effect, reducing the time and space costs of previous local-local cross-modal fusion methods. Finally, a multi-view aggregation prediction method is employed to fully leverage the complementarity of image, text, and image-text views. In conclusion, this model effectively captures the cross-modal semantic inconsistencies in multimodal sarcasm data and outperforms the existing best method, DMSD-Cl, on the public dataset MSD.

Key words: sarcasm detection, multimodal, contrastive learning, cross-modal fusion