Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (7): 157-166.DOI: 10.3778/j.issn.1002-8331.2211-0447

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Image-Guided Augmentation Visual Question Answering Model Combined with Contrastive Learning

YANG You, YAO Lu   

  1. 1.National Center for Applied Mathematics in Chongqing, Chongqing Normal University, Chongqing 401331, China
    2.School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
  • Online:2024-04-01 Published:2024-04-01

结合对比学习的图像指导增强视觉问答模型

杨有,姚露   

  1. 1.重庆师范大学 重庆国家应用数学中心,重庆 401331
    2.重庆师范大学 计算机与信息科学学院,重庆 401331

Abstract: Aiming at two problems of existing attention-based encoder-decoder visual question answering (VQA) models, image-guided augmentation VQA model combined with contrastive learning (IGA-CL) is proposed. One of these two problems is that single-type image feature contains incomplete visual information, another is that existing models rely overly on question guidance. To solve the first problem, the dual-feature visual decoder (DFVD) is proposed. It is based on the Transformer language encoder. After the single image feature is extended into two types:region and grid, visual information is refined through constructing complementary spatial relations based on the relative positions of different type features. To solve the second problem, the vision-guided language decoder (VGLD) is proposed. It twice matches the two decoded image features with the question features. In which, the parallel gated guided-attention (PGGA) is designed to correct adaptively the guiding proportions of different image features to the question. To obtain more similar mutual information, the contrastive learning loss function is introduced during the training process. It can compare the similarity of different modal features in the hidden space during model reasoning. The proposed model achieves 73.82%, 72.49% and 57.44% overall accuracy on the VQA 2.0, COCO-QA and GQA, respectively, which is 2.92 percentage points, 4.41 percentage points and 0.8 percentage points better than MCAN model. Extensive ablation experiments and visualization analysis demonstrate the effectiveness of the proposed model. Experimental results show that the proposed model can obtain more relevant language-vision information and has stronger generalization ability for different types of question samples.

Key words: visual question answering, attention mechanism, relative position, gated mechanism, contrastive learning

摘要: 针对现有的注意力编解码视觉问答模型存在两个问题:单一形态图像特征包含视觉信息不完整,以及对问题指导过度依赖,提出结合对比学习的图像指导增强视觉问答模型。所提模型包含一种双特征视觉解码器,它基于Transformer语言编码器实现,将单一的图像特征扩展为区域和网格两种形态,根据不同形态特征的相对位置构建互补的空间关系,以解决第一问题。所提模型包含一种视觉引导的语言解码器,将视觉解码的两种图像特征与问题特征二次匹配,通过平行门控引导注意力,自适应地修正不同视觉信息对问题的引导比例,以解决第二问题。所提模型,在训练过程中,引入对比学习损失函数,通过对比模型推理时不同模态特征在隐空间内的相似度,获取更相近的互信息。所提模型,在VQA 2.0、COCO-QA和GQA数据集上分别取得73.82%、72.49%和57.44%的总体准确率,较MCAN模型分别提高2.92个百分点、4.41个百分点和0.8个百分点。大量消融实验和可视化分析证明了模型的有效性。实验结果表明,所提模型能够获取更相关的语言-视觉信息,并且对不同类型的问题样本具有更强的泛化能力。

关键词: 视觉问答, 注意力机制, 相对位置, 门控机制, 对比学习