Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (6): 155-161.DOI: 10.3778/j.issn.1002-8331.2110-0115

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Cascading Attention Visual Question Answering Model Based on Graph Structure

ZHANG Haoyu, ZHANG De   

  1. School of Electrical and Information Engineering & Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
  • Online:2023-03-15 Published:2023-03-15



  1. 北京建筑大学 电气与信息工程学院&建筑大数据智能处理方法研究北京市重点实验室,北京 100044

Abstract: Visual question answering is a challenging problem, which needs to combine computer vision and natural language processing. Most of the existing methods use dual flow method, which first calculates the image and problem features, and then adopts different techniques and strategies for fusion. At present, it lacks of a higher-level representation that can directly capture the problem semantics and image spatial relations. This paper proposes a cascaded attention-learning model based on graph structure. The model combines graph learning module (learning the specific graph representation of input image questions), graph volume layer and cascaded attention layer. The purpose of the model is to capture the spatial information of images in different candidate box regions and the higher-level relationship between them and questions. Experiments are carried out on a large-scale data set VQA v2.0, and the results show that compared with the mainstream algorithms, the accuracy of yes/no, num and other types of questions are significantly improved. Furthermore, the overall accuracy reaches 68.34%, which verifies the effectiveness of the proposed model.

Key words: visual question answering, attention mechanism, graph convolutional network, feature fusion

摘要: 视觉问答是一个具有挑战性的问题,需要结合计算机视觉和自然语言处理的概念。大多数现有的方法使用双流方式,先分别计算图像和问题特征,然后再采取不同的技术和策略进行融合。目前,尚缺乏能够直接捕获问题语义和图像空间关系的更高层次的表示方法。提出一种基于图结构的级联注意力学习模型,该模型结合了图学习模块(学习输入图像问题的特定图表示)、图卷积层和级联注意力层,目的是捕捉不同候选框区域图像的空间信息,以及其与问题之间的更高层次的关系。在大规模数据集VQA v2.0上进行了实验,结果表明,跟主流算法相比较,是/否、计数和其他类型问题的回答准确率均有明显提升,总体准确率达到了68.34%,从而验证了提出模型的有效性。

关键词: 视觉问答, 注意力机制, 图卷积神经网络, 特征融合