[1] 包希港, 周春来, 肖克晶, 等. 视觉问答研究综述[J]. 软件学报, 2021, 32(8): 2522-2544.
BAO X G, ZHOU C L, XIAO K J, et al. Survey on visual question answering[J]. Journal of Software, 2021, 32(8): 2522-2544.
[2] 王瑞平, 吴士泓, 张美航, 等. 视觉问答语言处理方法综述[J]. 计算机工程与应用, 2022, 58(17): 50-60.
WANG R P, WU S H, ZHANG M H, et al. Review of language processing methods for visual question answering[J]. Computer Engineering and Applications, 2022, 58(17): 50-60.
[3] JIANG H, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10267-10276.
[4] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6077-6086.
[5] YU Z, YU J, CUI Y, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6281-6290.
[6] VASWANI A, SHAZEER N, PARMER N, et al. Attention is all you need[C]//Proceeding of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc, 2017: 6000-6010.
[7] YU Z, CUI Y, YU J, et al. Deep multimodal neural architecture search[C]//Proceedings of the 28th ACM International Conference on Multimedia, 2020: 3743-3752.
[8] RAHMAN T, CHOU S H, SIGAL L, et al. An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 1653-1662.
[9] YANG X, GAO C, ZHANG H, et al. Auto-parsing network for image captioning and visual question answering [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 2197-2207.
[10] ZHOU Y, REN T, ZHU C, et al. TRAR: routing the attention spans in transformer for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 2074-2084.
[11] LIU Y, WEI W, PENG D, et al. Depth-aware and semantic guided relational attention network for visual question answering[J]. IEEE Transactions on Multimedia, 2022, 25: 5344-5357.
[12] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[13] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[14] ZHANG X, SUN X, LUO Y, et al. RSTNet: captioning with adaptive attention on visual and non-visual words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 15465-15474.
[15] HU H, GU J, ZHANG Z, et al. Relation networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 3588-3597.
[16] LUO Y, JI J, SUN X, et al. Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 2286-2293.
[17] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning, 2021: 8748-8763.
[18] GOYAL Y, KHOT T, SUMMERS-STAU D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6904-6913.
[19] REN M, KIROS R, ZEMEL R. Exploring models and data for image question answering[C]//Proceeding of the 29th Conference on Neural Information Processing Systems, 2015: 2953-2961.
[20] HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6700-6709.
[21] PENG L, YANG Y, ZHANG X, et al. Answer again: improving VQA with cascaded-answering model[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 34(4): 1644-1655.
[22] LIU Y, ZHANG X, ZHANG Q, et al. Dual self-attention with co-attention networks for visual question answering[J]. Pattern Recognition, 2021, 117: 107956.
[23] LIU Y, ZHANG X, HUANG F, et al. Visual question answering via combining inferential attention and semantic space mapping[J]. Knowledge-Based Systems, 2020, 207: 106339.
[24] LIU Y, ZHANG X, ZHAO Z, et al. ALSA: adversarial learning of supervised attentions for visual question answering[J]. IEEE Transactions on Cybernetics, 2020, 52(6): 4520-4533.
[25] PENG L, YANG Y, WANG Z, et al. MRA-Net: improving VQA via multi-modal relation attention network[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(1): 318-329.
[26] MAO A, YANG Z, LIN K, et al. Positional attention guided transformer-like architecture for visual question answering[J]. IEEE Transactions on Multimedia, 2022, 25: 6997-7009.
[27] MIAO Y, CHENG W, HE S, et al. Research on visual question answering based on GAT relational reasoning[J]. Neural Processing Letters, 2022, 54(2): 1435-1448.
[28] ZHANG W, YU J, ZHAO W, et al. DMRFNet: deep multimodal reasoning and fusion for visual question answering and explanation generation[J]. Information Fusion, 2021, 72: 70-79.
[29] QIN B, HU H, ZHUANG Y. Deep residual weight-sharing attention network with low-rank attention for visual question answering[J]. IEEE Transactions on Multimedia, 2022, 25: 4282-4295. |