计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (11): 95-104.DOI: 10.3778/j.issn.1002-8331.2302-0064

• 模式识别与人工智能 • 上一篇    下一篇

融合改进图卷积的跨模态检索

张宏图,化春键,蒋毅,俞建峰,陈莹   

  1. 1.江南大学 机械工程学院,江苏 无锡 214122
    2.江苏省食品先进制造装备技术重点实验室,江苏 无锡 214122
    3.江南大学 物联网工程学院,江苏 无锡 214122
  • 出版日期:2024-06-01 发布日期:2024-05-31

Cross-Modal Retrieval with Improved Graph Convolution

ZHANG Hongtu, HUA Chunjian, JIANG Yi, YU Jianfeng, CHEN Ying   

  1. 1.School of Mechanical Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
    2.Jiangsu Key Laboratory of Advanced Food Manufacturing Equipment & Technology, Wuxi, Jiangsu 214122, China
    3.School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2024-06-01 Published:2024-05-31

摘要: 针对现有跨模态检索在公共子空间度量时难以充分挖掘模态内局部一致性的问题,提出了一种融合改进图卷积的跨模态检索方法。为了提升各模态内的局部一致性,以单个完整样本为节点构建模态图,充分挖掘特征间的交互信息;为了解决图卷积网络只能做浅层学习的问题,采用在每一层图卷积添加初始残差链接和权重恒等映射的方法来缓解此现象;为了通过高低阶邻居信息共同更新中心节点特征,提出减少邻居节点、增加图卷积网络层数的改进;为了学习高度局部一致且语义一致的公共表征,共享公共表征学习层权重,并联合优化公共子空间中模态内的语义约束和模态间的模态不变约束。实验结果表明,在Wikipedia和Pascal sentence这两个跨模态数据集上,不同检索任务的平均mAP值比11种现有方法分别提升了2.2%~42.1%和3.0%~54.0%。

关键词: 图卷积网络, 跨模态检索, 初始残差连接, 恒等映射, 邻接矩阵

Abstract: Aiming at the problem that existing image text cross-modal retrieval is difficult to fully exploit the local consistency in the mode in the common subspace, a cross-modal retrieval method based on improved graph convolution is proposed. In order to improve the local consistency within each mode, the modal diagram is constructed with a single sample as a node, fully mining the interactive information between features. In order to solve the problem that graph convolution network can only do shallow learning, the method of adding initial residual link and weight identity map in each layer of graph convolution is adopted to alleviate this phenomenon. In order to jointly update the central node features through higher-order and lower-order neighbor information, an improvement is proposed to reduce neighbor nodes and increase the number of layers in graph convolution network. In order to learn highly locally consistent and semantically consistent public representation, it shares the weights of common representation learning layer, and jointly optimizes the semantic constraints within the modes and the modal invariant constraints between modes in the common subspace. The experimental results show that on the two cross-modal data sets of Wikipedia and Pascal sentence, the average mAP values of different retrieval tasks are 2.2%~42.1% and 3.0%~54.0% higher than the 11 existing methods.

Key words: graph convolution network, cross-modal retrieval, initial residual connection, identity mapping, adjacency matrix