Cascading Attention Visual Question Answering Model Based on Graph Structure

doi:10.3778/j.issn.1002-8331.2110-0115

Abstract

Abstract: Visual question answering is a challenging problem, which needs to combine computer vision and natural language processing. Most of the existing methods use dual flow method, which first calculates the image and problem features, and then adopts different techniques and strategies for fusion. At present, it lacks of a higher-level representation that can directly capture the problem semantics and image spatial relations. This paper proposes a cascaded attention-learning model based on graph structure. The model combines graph learning module （learning the specific graph representation of input image questions）, graph volume layer and cascaded attention layer. The purpose of the model is to capture the spatial information of images in different candidate box regions and the higher-level relationship between them and questions. Experiments are carried out on a large-scale data set VQA v2.0, and the results show that compared with the mainstream algorithms, the accuracy of yes/no, num and other types of questions are significantly improved. Furthermore, the overall accuracy reaches 68.34%, which verifies the effectiveness of the proposed model.

Key words: visual question answering, attention mechanism, graph convolutional network, feature fusion

摘要： 视觉问答是一个具有挑战性的问题，需要结合计算机视觉和自然语言处理的概念。大多数现有的方法使用双流方式，先分别计算图像和问题特征，然后再采取不同的技术和策略进行融合。目前，尚缺乏能够直接捕获问题语义和图像空间关系的更高层次的表示方法。提出一种基于图结构的级联注意力学习模型，该模型结合了图学习模块（学习输入图像问题的特定图表示）、图卷积层和级联注意力层，目的是捕捉不同候选框区域图像的空间信息，以及其与问题之间的更高层次的关系。在大规模数据集VQA v2.0上进行了实验，结果表明，跟主流算法相比较，是/否、计数和其他类型问题的回答准确率均有明显提升，总体准确率达到了68.34%，从而验证了提出模型的有效性。

关键词: 视觉问答, 注意力机制, 图卷积神经网络, 特征融合

ZHANG Haoyu, ZHANG De. Cascading Attention Visual Question Answering Model Based on Graph Structure[J]. Computer Engineering and Applications, 2023, 59(6): 155-161.

张昊雨, 张德. 基于图结构的级联注意力视觉问答模型[J]. 计算机工程与应用, 2023, 59(6): 155-161.

References

[1] WU Q，TENEY D，WANG P，et al.Visual question answering：a survey of methods and datasets[J].Computer Vision and Image Understanding，2017，163：21-40.
[2] YI K，TORRALBA A，WU J，et al.Neural-symbolic VQA：disentangling reasoning from vision and language understanding[C]//Proceedings of Annual Conference on Neural Information Processing Systems，2018：1031-1042.
[3] SINGH A，NATARAJAN V，SHAH M，et al.Towards VQA models that can read[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2019：8309-8318.
[4] MALINOWSKI M，FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input[C]//Proceedings of Annual Conference on Neural Information Processing Systems，2014：1682-1690.
[5] FUKUI A，PARK D H，YANG D，et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing，2016：457-468.
[6] BEN-YOUNES H，CADENE R，CORD M，et al.MUTAN：multimodal tucker fusion for visual question answering[C]//Proceedings of IEEE International Conference on Computer Vision，2017：2631-2639.
[7] YU Z，YU J，XIANG C，et al.Beyond bilinear：generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems，2018，29（12）：5947-5959.
[8] ANDERSON P，HE X D，BUEHLER C，et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2018：6077-6086.
[9] YANG Z C，HE X D，GAO J F，et al.Stacked attention networks for image question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2016：21-29.
[10] 吝博强，田文洪.基于层次注意力机制的高效视觉问答模型[J].计算机应用研究，2021，38（2）：636-640.
LIN B Q，TIAN W H.Efficient image question answering model based on layered attention mechanism[J].Application Research of Computers，2021，38（2）：636-640.
[11] CADENE R，BEN-YOUNES H，CORD M，et al.Murel：multimodal relational reasoning for visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2019：1989-1998.
[12] ZHANG Y，HARE J，ADAM P B.Learning to count objects in natural images for visual question answering[C]//Proceedings of International Conference on Learning Representations，2018：10-18.
[13] WILL N B，VAFEIAS E，PARISOT S.Learning conditioned graph structures for interpretable visual question answering[C]//Proceedings of Conference on Neural Information Processing Systems，2018：8334-8343.
[14] ZHU X，MAO Z D，CHEN Z N.Object-difference drived graph convolutional networks for visual question answering[J].Multimedia Tools and Applications，2021，80（11）：16247-16265.
[15] YANG Z Q，QIN Z H，YU J，et al.Prior visual relationship for visual question answering[C]//Proceedings of International Conference on Image Processing，2020：1411-1415.
[16] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Proceedings of Annual Conference on Neural Information Processing Systems，2017：5999-6009.
[17] GOYAL Y，KHOT T，SUMMERS-STAY D，et al.Making the V in VQA matter：elevating the role of image understanding in visual question answering[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2017：6904-6913.
[18] KRISHNA R，ZHU Y K，GROTH O，et al.Visual genome：connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision，2017，123（1）：32-73.
[19] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[C]//Proceedings of European Conference on Computer Vision，2014：740-755.
[20] PENNINGTON J，SOCHER R，CHRISTOPHER M.GloVe：global vectors for word representation[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing，2014：1532-1543.