Visual Question Answering Research on Joint Knowledge and Visual Information Reasoning

doi:10.3778/j.issn.1002-8331.2209-0456

Abstract

Abstract: As a task in the multimodal field, visual question answering requires fusion and reasoning of the features of different modalities, which has important application value. In traditional visual question answering, the answer to the question can be well reasoned only by relying on the visual information of the image. However, pure visual information cannot meet the diverse question-answering needs in real-world scenarios. Knowledge plays an important role in visual question answering and can well assist question answering. Knowledge-based open visual question answering needs to correlate external knowledge to achieve cross-modal scene understanding. In order to better integrate visual information and related external knowledge, a bilinear structure for joint knowledge and visual information reasoning is proposed, and a dual-guided attention module for knowledge representation by image features and question features is designed. Firstly, the model uses the pre-trained vision-language model to obtain the feature representation and visual reasoning information of the question and image, Secondly, the similarity matrix is used to calculate the image object area under the semantic alignment of the question, and then the regional features after the joint alignment of the question features jointly guide the knowledge representation to obtain knowledge reasoning information. Finally, the visual reasoning information and the knowledge reasoning information are fused to get the final answer. The experimental results on the OK-VQA dataset show that the accuracy of the model is 1.97 percentage points and 4.82 percentage points higher than the two baseline methods, respectively, which verifies the effectiveness of the model.

Key words: visual question answering, attention mechanism, feature fusion, multimodal alignment, external knowledge

摘要： 视觉问答作为多模态领域中的一项任务，需要对不同模态的特征进行融合推理，具有重要的应用价值。在传统视觉问答中，只需依靠图像的视觉信息，便能很好地推理出问题答案，但纯视觉信息无法满足现实场景中多样化的问答需求。知识在视觉问答中发挥着重要的作用，能够很好地辅助问答。基于知识的开放性视觉问答需要关联外部知识，才能实现跨模态的场景理解。为了更好地融合视觉信息和相关联的外部知识，提出联合知识和视觉信息推理双线性结构，设计了图像特征联合问题特征，对知识表征进行双引导的注意力模块。该模型利用预训练的视觉-语言模型获取问题和图像的特征表示以及视觉推理信息；利用相似性矩阵计算问题语义对齐下的图像对象区域；问题特征联合对齐后的区域特征，对知识表征进行协同引导获得知识推理信息；视觉推理信息和知识推理信息进行融合得到最终的答案。在开放的OK-VQA数据集上的实验结果表明，该模型的准确率相比两种基线方法分别有1.97个百分点和4.82个百分点的提升，从而验证了该模型的有效性。

关键词: 视觉问答, 注意力机制, 特征融合, 多模态对齐, 外部知识

SU Zhenqiang, GOU Gang. Visual Question Answering Research on Joint Knowledge and Visual Information Reasoning[J]. Computer Engineering and Applications, 2024, 60(5): 95-102.

苏振强, 苟刚. 联合知识和视觉信息推理的视觉问答研究[J]. 计算机工程与应用, 2024, 60(5): 95-102.

References

[1] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 2425-2433.
[2] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the 1st International Conference on Learning Representations. Scottsdale, USA: ICLR, 2013: 1-12.
[3] PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532-1543.
[4] KENTON J D M W C, TOUTANOVA L K. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[5] SIMONYAN K, ZISSERMAN A. Very deep convolutional neworks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learing Representations. San Diego, USA: ICLR, 2015: 1-14.
[6] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[7] REN S, HE K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems, 2017: 1137-1149.
[8] MALINOWSKI M, ROHRBACH M, FRITZ M. Ask your neurons: a neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1-9.
[9] GRAVES A. Long short-term memory[J]. Supervised Sequence Labelling with Recurrent Neural Networks, 2012: 37-45.
[10] REN M, KIROS R, ZEMEL R. Image question answering: a visual semantic embedding model and a new dataset[C]//Advances in Neural Information Processing Systems, 2015.
[11] CHEN K, WANG J, CHEN L C, et al. ABC-CNN: an attention based convolutional neural network for visual question answering[J]. arXiv:1511.05960,2015.
[12] FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Conference on Empirical Methods in Natural Language Processing, 2016: 457-468.
[13] BEN-YOUNES H, CADENE R, CORD M, et al. Mutan: multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2612-2620.
[14] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6077-6086.
[15] KIM J H, JUN J, ZHANG B T. Bilinear attention networks[C]//Advances in Neural Information Processing Systems, 2018: 1571-1581.
[16] YU Z, YU J, CUI Y, et al. Deep modular co-attention net-works for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6281-6290.
[17] ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[C]//Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021: 1097-1103.
[18] 王屹超, 朱慕华, 许晨, 等. 利用图像描述与知识图谱增强表示的视觉问答[J]. 清华大学学报 (自然科学版), 2022, 62(5): 900-907.
WANG Y C, ZHU M H, XU C, et al. Exploiting image caption and external knowledge as representation enhancement for VQA[J]. Journal of Tsinghua University (Natural Science Edition), 2022, 62(5): 900-907.
[19] TAN H, BANSAL M. LXMERT: learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019: 5100-5111.
[20] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[C]//Proceedings of the 2018 Conference of the NAACL-HLT, 2018.
[21] DING Y, YU J, LIU B, et al. MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 5089-5098.
[22] VRANDE?i? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
[23] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning, 2021: 8748-8763.
[24] REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using siamese BERT-networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 3982-3992.
[25] WANG P, WU Q, SHEN C, et al. FVQA: fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(10): 2413-2427.
[26] WANG P, WU Q, SHEN C, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017: 1290-1296.
[27] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3195-3204.
[28] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//Proceedings of the European Conference on Computer Vision, 2014: 740-755.
[29] YU J, ZHU Z, WANG Y, et al. Cross-modal knowledge reasoning for knowledge-based visual question answering[J]. Pattern Recognition, 2020, 108: 107563.
[30] LU J, BATRA D, PARIKH D, et al. VILBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems, 2019.
[31] MARINO K, CHEN X, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14111-14121.
[32] GARDèRES F, ZIAEEFARD M, ABELOOS B, et al. Conceptbert: concept-aware representation for visual question answering[C]//Proceedings of the Association for Computational Linguistics, 2020: 489-498.