[1] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 2425-2433.
[2] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the 1st International Conference on Learning Representations. Scottsdale, USA: ICLR, 2013: 1-12.
[3] PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532-1543.
[4] KENTON J D M W C, TOUTANOVA L K. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[5] SIMONYAN K, ZISSERMAN A. Very deep convolutional neworks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learing Representations. San Diego, USA: ICLR, 2015: 1-14.
[6] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[7] REN S, HE K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems, 2017: 1137-1149.
[8] MALINOWSKI M, ROHRBACH M, FRITZ M. Ask your neurons: a neural-based approach to answering questions about images[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1-9.
[9] GRAVES A. Long short-term memory[J]. Supervised Sequence Labelling with Recurrent Neural Networks, 2012: 37-45.
[10] REN M, KIROS R, ZEMEL R. Image question answering: a visual semantic embedding model and a new dataset[C]//Advances in Neural Information Processing Systems, 2015.
[11] CHEN K, WANG J, CHEN L C, et al. ABC-CNN: an attention based convolutional neural network for visual question answering[J]. arXiv:1511.05960,2015.
[12] FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Conference on Empirical Methods in Natural Language Processing, 2016: 457-468.
[13] BEN-YOUNES H, CADENE R, CORD M, et al. Mutan: multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2612-2620.
[14] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6077-6086.
[15] KIM J H, JUN J, ZHANG B T. Bilinear attention networks[C]//Advances in Neural Information Processing Systems, 2018: 1571-1581.
[16] YU Z, YU J, CUI Y, et al. Deep modular co-attention net-works for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6281-6290.
[17] ZHU Z, YU J, WANG Y, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[C]//Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021: 1097-1103.
[18] 王屹超, 朱慕华, 许晨, 等. 利用图像描述与知识图谱增强表示的视觉问答[J]. 清华大学学报 (自然科学版), 2022, 62(5): 900-907.
WANG Y C, ZHU M H, XU C, et al. Exploiting image caption and external knowledge as representation enhancement for VQA[J]. Journal of Tsinghua University (Natural Science Edition), 2022, 62(5): 900-907.
[19] TAN H, BANSAL M. LXMERT: learning cross-modality encoder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019: 5100-5111.
[20] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[C]//Proceedings of the 2018 Conference of the NAACL-HLT, 2018.
[21] DING Y, YU J, LIU B, et al. MuKEA: multimodal knowledge extraction and accumulation for knowledge-based visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 5089-5098.
[22] VRANDE?i? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
[23] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning, 2021: 8748-8763.
[24] REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using siamese BERT-networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019: 3982-3992.
[25] WANG P, WU Q, SHEN C, et al. FVQA: fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(10): 2413-2427.
[26] WANG P, WU Q, SHEN C, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017: 1290-1296.
[27] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3195-3204.
[28] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//Proceedings of the European Conference on Computer Vision, 2014: 740-755.
[29] YU J, ZHU Z, WANG Y, et al. Cross-modal knowledge reasoning for knowledge-based visual question answering[J]. Pattern Recognition, 2020, 108: 107563.
[30] LU J, BATRA D, PARIKH D, et al. VILBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems, 2019.
[31] MARINO K, CHEN X, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14111-14121.
[32] GARDèRES F, ZIAEEFARD M, ABELOOS B, et al. Conceptbert: concept-aware representation for visual question answering[C]//Proceedings of the Association for Computational Linguistics, 2020: 489-498. |