Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (19): 184-192.DOI: 10.3778/j.issn.1002-8331.2102-0060

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Cross-Modal Representation Learning Under Multi-Negatives Contrastive Mechanism

DING Kaixuan, CHEN Yanxiang, ZHAO Pengcheng, ZHU Yupeng, SHENG Zhentao   

  1. School of Computer and Information, Hefei University of Technology, Hefei 230601, China
  • Online:2022-10-01 Published:2022-10-01



  1. 合肥工业大学 计算机与信息学院,合肥 230601

Abstract: In order to obtain more distinctive cross-modal representations effectively, a cross-modal representation learning method based on the multi-negatives contrastive mechanism—supervised contrastive cross-modal representation learning(SCCMRL) is proposed, and it is applies to the modalities of vision and audio. SCCMRL extracts vision and audio features through vision encoder and audio encoder which uses supervised contrastive loss to compare sample with its multiple negatives. As a result, the audio-visual features that belong to the same category are closer, and the audio-visual features that belong to different categories are more distant. Furthermore, this method also introduces center loss and label loss to ensure the modality consistency and semantic discrimination between cross-modal representations. To verify the effectiveness of the SCCMRL method, this paper constructs a corresponding cross-modal retrieval system, which conducts cross-modal retrieval experiments based on the Sub_URMP and XmediaNet datasets. The experimental results show that the SCCMRL method has achieved a higher mAP value than the current cross-modal retrieval methods that are used commonly. It also verifies the feasibility of applying the multi-negatives contrastive mechanism in cross-modal representation learning.

Key words: cross-modal representation learning, multimodal feature fusion, multi-negatives contrastive mechanism, supervised contrastive loss, cross-modal retrieval

摘要: 为了有效地获取到更有区别性的跨模态表示,提出了一种基于多负例对比机制的跨模态表示学习方法——监督对比的跨模态表示学习(supervised contrastive cross-modal representation learning,SCCMRL),并将其应用于视觉模态和听觉模态上。SCCMRL分别通过视觉编码器和音频编码器提取得到视听觉特征,利用监督对比损失让样本数据与其多个负例进行对比,使得相同类别的视听觉特征距离更近,不同类别的视听觉特征距离更远。此外,该方法还引入了中心损失和标签损失来进一步保证跨模态表示间的模态一致性和语义区分性。为了验证SCCMRL方法的有效性,基于SCCMRL方法构建了相应的跨模态检索系统,并结合Sub_URMP和XmediaNet数据集进行了跨模态检索实验。实验结果表明,SCCMRL方法相较于当前常用的跨模态检索方法取得了更高的mAP值,同时验证了多负例对比机制下的跨模态表示学习具有可行性。

关键词: 跨模态表示学习, 多模态特征融合, 多负例对比机制, 监督对比损失, 跨模态检索