计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (16): 115-124.DOI: 10.3778/j.issn.1002-8331.2205-0056

• 模式识别与人工智能 • 上一篇    下一篇

用于图文检索的跨模态信息交互推理网络

魏钰琦,李宁   

  1. 东北大学 理学院,沈阳 110819
  • 出版日期:2023-08-15 发布日期:2023-08-15

Cross-Modal Information Interaction Reasoning Network for Image and Text Retrieval

WEI Yuqi, LI Ning   

  1. College of Science, Northeastern University, Shenyang 110819, China
  • Online:2023-08-15 Published:2023-08-15

摘要: 针对跨模态检索任务中图像与文本模态的语义特征复杂度不一致问题,提出了一种局部细粒度对齐与全局特征推理相结合的图文匹配方法。首先将图像和文本特征输入自适应交叉注意网络,该网络在交叉注意机制内设置门控单元,利用文本(图像)模态中的相关语义特征,自适应地引导图像(文本)模态的交叉注意。突出关键的局部对齐特征的同时及时高效地过滤掉冗余的交互信息,从而实现更精准的细粒度对齐。然后利用自适应交叉注意网络输出的包含文本(图像)引导信息的特征,在全局推理网络中逐步合成图像(文本)全局对齐特征。不仅利用这些特征之间的长短期记忆关系灵活地将细粒度对齐特征融合为全局特征,并且在迭代更新当前全局特征时,能够根据跨模态交互信息加深对整体潜在语义信息的理解。最后采用交叉熵损失函数训练整个模型。提出的模型在公开数据集MS COCO和Flickr 30k上进行一系列实验,利用Recall@K指标对比实验结果,证明该模型优于目前的先进模型。

关键词: 跨模态图文检索, 交叉注意力, 关系推理, 多模态交互

Abstract: An image-text matching strategy combining local fine-grained alignment and global feature inference is presented to tackle the inconsistency of semantic feature complexity between image and text modalities in cross-modal retrieval tasks. Firstly, the image and text features are input into an adaptive cross-attention network, which sets up gating units in the cross-attention mechanism and uses the relevant semantic features in the text(image) modalities to adaptively guide the cross-attention of the image(text) modalities. While highlighting key local alignment features, redundant interactive information is filtered out in a timely and efficient manner, thereby achieving more accurate fine-grained alignment. Then, the image(text) global alignment features are gradually synthesized in the global inference network by using the features output of the adaptive cross-attention network that contains text(image) guidance information. It not only utilizes the long-term and short-term memory relationship between these features to flexibly fuse fine-grained aligned features into global features, but also can deepen the understanding of the overall latent semantic information based on cross-modal interaction information when iteratively updating the current global features. Finally, the entire model is trained by using the cross-entropy loss function. The proposed model conducts a series of experiments on the public datasets MS COCO and Flickr 30k, and uses the Recall@K indicator to compare the experimental results, proving that the model is superior to the current state-of-the-art models.

Key words: cross-modal image retrieval, cross-attention, relational reasoning, multimodal interaction