计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (5): 95-102.DOI: 10.3778/j.issn.1002-8331.2209-0456

• 模式识别与人工智能 • 上一篇    下一篇

联合知识和视觉信息推理的视觉问答研究

苏振强,苟刚   

  1. 贵州大学 计算机科学与技术学院 公共大数据国家重点实验室,贵阳 550025
  • 出版日期:2024-03-01 发布日期:2024-03-01

Visual Question Answering Research on Joint Knowledge and Visual Information Reasoning

SU Zhenqiang, GOU Gang   

  1. State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
  • Online:2024-03-01 Published:2024-03-01

摘要: 视觉问答作为多模态领域中的一项任务,需要对不同模态的特征进行融合推理,具有重要的应用价值。在传统视觉问答中,只需依靠图像的视觉信息,便能很好地推理出问题答案,但纯视觉信息无法满足现实场景中多样化的问答需求。知识在视觉问答中发挥着重要的作用,能够很好地辅助问答。基于知识的开放性视觉问答需要关联外部知识,才能实现跨模态的场景理解。为了更好地融合视觉信息和相关联的外部知识,提出联合知识和视觉信息推理双线性结构,设计了图像特征联合问题特征,对知识表征进行双引导的注意力模块。该模型利用预训练的视觉-语言模型获取问题和图像的特征表示以及视觉推理信息;利用相似性矩阵计算问题语义对齐下的图像对象区域;问题特征联合对齐后的区域特征,对知识表征进行协同引导获得知识推理信息;视觉推理信息和知识推理信息进行融合得到最终的答案。在开放的OK-VQA数据集上的实验结果表明,该模型的准确率相比两种基线方法分别有1.97个百分点和4.82个百分点的提升,从而验证了该模型的有效性。

关键词: 视觉问答, 注意力机制, 特征融合, 多模态对齐, 外部知识

Abstract: As a task in the multimodal field, visual question answering requires fusion and reasoning of the features of different modalities, which has important application value. In traditional visual question answering, the answer to the question can be well reasoned only by relying on the visual information of the image. However, pure visual information cannot meet the diverse question-answering needs in real-world scenarios. Knowledge plays an important role in visual question answering and can well assist question answering. Knowledge-based open visual question answering needs to correlate external knowledge to achieve cross-modal scene understanding. In order to better integrate visual information and related external knowledge, a bilinear structure for joint knowledge and visual information reasoning is proposed, and a dual-guided attention module for knowledge representation by image features and question features is designed. Firstly, the model uses the pre-trained vision-language model to obtain the feature representation and visual reasoning information of the question and image, Secondly, the similarity matrix is used to calculate the image object area under the semantic alignment of the question, and then the regional features after the joint alignment of the question features jointly guide the knowledge representation to obtain knowledge reasoning information. Finally, the visual reasoning information and the knowledge reasoning information are fused to get the final answer. The experimental results on the OK-VQA dataset show that the accuracy of the model is 1.97 percentage points and 4.82 percentage points higher than the two baseline methods, respectively, which verifies the effectiveness of the model.

Key words: visual question answering, attention mechanism, feature fusion, multimodal alignment, external knowledge