视觉问答语言处理方法综述

doi:10.3778/j.issn.1002-8331.2203-0243

摘要/Abstract

摘要： 视觉问答中的语言处理方法对视觉问答模型的性能影响巨大。语言处理方法源于自然语言处理，但在发展过程中与自然语言处理领域最先进技术脱节，导致视觉问答中涉及的问题理解和答案生成受阻。产生这一问题的根源主观上是研究人员对语言处理方法的重要性认识不足，客观上则是相关研究文献的匮乏。针对上述问题，通过分析语言处理对视觉问答的价值，调查视觉问答中涉及到的语言处理方法和最新研究成果，归纳总结语言处理方法的类型，从而为研究人员认识语言处理重要性提供基础；探讨了自然语言处理技术对视觉问答中语言处理方法的推动作用，并展望了语言处理方法未来的发展方向。

关键词: 视觉问答, 自然语言处理, 语言模型, 深度神经网络, 人工智能

Abstract: Language processing methods in visual question answering have a huge impact on the performance of visual question answering models. Language processing methods and theories are derived from natural language processing, but in the development process they are out of touch with the most advanced research results in the field of natural language processing, which hinders the understanding of questions and the generation of answers involved in visual question answering. The root cause of this problem is subjectively the lack of researchers’ understanding of the importance of language processing methods, and objectively the lack of relevant research literature. In response to the above problems, this paper analyzes the meaning and value of language processing for visual question answering, investigates the language processing methods involved in visual question answering and the latest research results in the field of natural language processing, and summarizes the relevant application scenarios of natural language processing. The research results of this paper provide the basis and possibility for researchers to realize the importance of language processing. Finally, the future development of language processing and the promotion of natural language processing technology to visual question answering are prospected, and the deficiencies of this paper are discussed.

Key words: visual question answering, natural language processing, language model, deep neural network, artificial intelligence

王瑞平, 吴士泓, 张美航, 王小平. 视觉问答语言处理方法综述[J]. 计算机工程与应用, 2022, 58(17): 50-60.

WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping. Review of Language Processing Methods for Visual Question Answering[J]. Computer Engineering and Applications, 2022, 58(17): 50-60.

参考文献

[1] ZHANG D，CAO R，WU S.Information fusion in visual question answering：a survey[J].Information Fusion，2019，52：268-280.
[2] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[3] CHO K，VAN MERRIENBOER B，GULCEHRE C，et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv：1406. 1078，2014.
[4] MANMADHAN S，KOVOOR B C.Visual question answering：a state-of-the-art review[J].Artificial Intelligence Review，2020，53（8）：5705-5745.
[5] ZHANG W，YU J，ZHAO W，et al.DMRFNet：deep multimodal reasoning and fusion for visual question answering and explanation generation[J].Information Fusion，2021，72：70-79.
[6] UROOJ A，KUEHNE H，DUARTE K，et al.Found a reason for me? weakly-supervised grounded visual question answering using capsules[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：8465-8474.
[7] SHARMA H，JALAL A S.Visual question answering model based on graph neural network and contextual attention[J].Image and Vision Computing，2021：104165.
[8] RAHMAN T，CHOU S H，SIGAL L，et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：1653-1662.
[9] PENNINGTON J，SOCHER R，MANNING C D.Glove：global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：1532-1543.
[10] WHITEHEAD S，WU H，JI H，et al.Separating skills and concepts for novel visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：5632-5641.
[11] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[12] 李舟军，范宇，吴贤杰.面向自然语言处理的预训练技术研究综述[J].计算机科学，2020，47（3）：162-173.
LI Z J，FAN Y，WU X J.Survey of natural language processing pre-training techniques[J].Computer Science，2020，47（3）：162-173.
[13] MIKOLOV T，SUTSKEVER I，CHEN K，et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems，2013：3111-3119.
[14] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[15] PETERS M，NEUMANN M，IYYER M，et al.Deep contextualized word representations[J].arXiv：1802.05365，2018.
[16] RADFORD A，NARASIMHAN K，SALIMANS T，et al.Improving language understanding by generative pre-training[EB/OL].[2022-01-20].https：//s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language understanding paper.pdf.
[17] SUN Y，WANG S，LI Y，et al.Ernie：enhanced representation through knowledge integration[J].arXiv：1904.09223，2019.
[18] ZHANG Z，HAN X，LIU Z，et al.ERNIE：enhanced language representation with informative entities[J].arXiv：1905.07129，2019.
[19] YANG Z，DAI Z，YANG Y，et al.Xlnet：generalized autoregressive pretraining for language understanding[C]//Advances in Neural Information Processing Systems，2019.
[20] 陈德光，马金林，马自萍，等.自然语言处理预训练技术综述[J].计算机科学与探索，2021，15（8）：1359-1389.
CHEN D G，MA J L，MA Z P，et al.Review of pre-training techniques for natural language processing[J].Journal of Frontiers of Computer Science and Technology，2021，15（8）：1359-1389.
[21] OTTER D W，MEDINA J R，KALITA J K.A survey of the usages of deep learning for natural language processing[J].IEEE Transactions on Neural Networks and Learning Systems，2020，32（2）：604-624.
[22] XU H，SAENKO K.Ask，attend and answer：exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision，2016：451-466.
[23] WU Q，SHEN C，WANG P，et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，40（6）：1367-1381.
[24] YU D，FU J，MEI T，et al.Multi-level attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：4709-4717.
[25] YU Z，YU J，FAN J，et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：1821-1830.
[26] BEN-YOUNES H，CADENE R，CORD M，et al.Mutan：multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2612-2620.
[27] MALINOWSKI M，ROHRBACH M，FRITZ M.Ask your neurons：a deep learning approach to visual question answering[J].International Journal of Computer Vision，2017，125（1）：110-135.
[28] TENEY D，LIU L，VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：1-9.
[29] JANG Y，SONG Y，YU Y，et al.TGIF-QA：toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：2758-2766.
[30] ANDERSON P，HE X，BUEHLER C，et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6077-6086.
[31] MA C，SHEN C，DICK A，et al.Visual question answering with memory-augmented networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6975-6984.
[32] QIAO T，DONG J，XU D.Exploring human-like attention supervision in visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018.
[33] SONG J，ZENG P，GAO L，et al.From pixels to objects：cubic visual attention for visual question answering[C]//Proceedings of IJCAI，2018：906-912.
[34] SU Z，ZHU C，DONG Y，et al.Learning visual knowledge memory networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7736-7745.
[35] SHI Y，FURLANELLO T，ZHA S，et al.Question type guided attention in visual question answering[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：151-166.
[36] BAI Y，FU J，ZHAO T，et al.Deep attention neural tensor network for visual question answering[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：20-35.
[37] LIANG J，JIANG L，CAO L，et al.Focal visual-text attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6135-6143.
[38] NARASIMHAN M，LAZEBNIK S，SCHWING A.Out of the box：reasoning with graph convolution nets for factual visual question answering[J].arXiv：1811.00538，2018.
[39] NARASIMHAN M，SCHWING A G.Straight to the facts：learning knowledge base retrieval for factual visual question answering[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：451-468.
[40] NGUYEN D K，OKATANI T.Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6087-6096.
[41] TENEY D，VAN DEN HENGEL A.Visual question answering as a meta learning task[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：219-235.
[42] GAO P，LI H，LI S，et al.Question-guided hybrid convolution for visual question answering[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：469-485.
[43] LU P，LI H，ZHANG W，et al.Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018.
[44] WU C，LIU J，WANG X，et al.Chain of reasoning for visual question answering[C]//Advances in Neural Information Processing Systems，2018：275-285.
[45] WU C，LIU J，WANG X，et al.Object-difference attention：a simple relational attention for visual question answering[C]//Proceedings of the 26th ACM International Conference on Multimedia，2018：519-527.
[46] DO T，DO T T，TRAN H，et al.Compact trilinear interaction for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：392-401.
[47] GAO L，ZENG P，SONG J，et al.Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：6391-6398.
[48] GAO P，JIANG Z，YOU H，et al.Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：6639-6648.
[49] JHA S，DEY A，KUMAR R，et al.A novel approach on visual question answering by parameter prediction using faster region based convolutional neural network[J].International Journal of Interactive Multimedia and Artificial Intelligence，2019，5（5）：30-37.
[50] LI L，GAN Z，CHENG Y，et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：10313-10322.
[51] LIU F，LIU J，FANG Z，et al.Densely connected attention flow for visual question answering[C]//Proceedings of IJCAI，2019：869-875.
[52] OSMAN A，SAMEK W.DRAU：dual recurrent attention units for visual question answering[J].Computer Vision and Image Understanding，2019，185：24-30.
[53] SHRESTHA R，KAFLE K，KANAN C.Answer them all! toward universal visual question answering models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：10472-10481.
[54] YU Z，YU J，CUI Y，et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：6281-6290.
[55] HONG J，FU J，UH Y，et al.Exploiting hierarchical visual features for visual question answering[J].Neurocomputing，2019，351：187-195.
[56] WU C，LIU J，WANG X，et al.Differential networks for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：8997-9004.
[57] XI Y，ZHANG Y，DING S，et al.Visual question answering model based on visual relationship detection[J].Signal Processing：Image Communication，2020，80：115648.
[58] DO T，NGUYEN B X，TRAN H，et al.Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering[C]//European Conference on Computer Vision，2020：496-510.
[59] GAO D，WANG R，SHAN S，et al.Learning to recognize visual concepts for visual question answering with structural label space[J].IEEE Journal of Selected Topics in Signal Processing，2020，14（3）：494-505.
[60] HONG J，PARK S，BYUN H.Selective residual learning for visual question answering[J].Neurocomputing，2020，402：366-374.
[61] LEI C，WU L，LIU D，et al.Multi-question learning for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：11328-11335.
[62] YU J，ZHU Z，WANG Y，et al.Cross-modal knowledge reasoning for knowledge-based visual question answering[J].Pattern Recognition，2020，108：107563.
[63] ZHANG L，LIU S，LIU D，et al.Rich visual knowledge-based augmentation network for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems，2021，32（10）：4362-4373.
[64] ZHANG W，YU J，HU H，et al.Multimodal feature fusion by relational reasoning and attention for visual question answering[J].Information Fusion，2020，55：116-126.
[65] LIU Y，ZHANG X，HUANG F，et al.Adversarial learning with multi-modal attention for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems，2021，32（9）：3894-3908.
[66] KIM J，LEE D，WU J，et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks，2021，139：158-167.
[67] GUO W，ZHANG Y，YANG J，et al.Re-attention for visual question answering[J].IEEE Transactions on Image Processing，2021，30：6730-6743.
[68] LAO M，GUO Y，PU N，et al.Multi-stage hybrid embedding fusion network for visual question answering[J].Neurocomputing，2021，423：541-550.
[69] LI H，HAN D.Multimodal encoders and decoders with gate attention for visual question answering[J].Computer Science and Information Systems，2021：32.
[70] WU Y，MA Y，WAN S.Multi-scale relation reasoning for multi-modal visual question answering[J].Signal Processing：Image Communication，2021，96：116319.
[71] ZHANG S，CHEN M，CHEN J，et al.Multimodal feature-wise co-attention method for visual question answering[J].Information Fusion，2021，73：1-10.
[72] BAI Z，LI Y，WO?NIAK M，et al.DecomVQANet：decomposing visual question answering deep network via tensor decomposition and regression[J].Pattern Recognition，2021，110：107538.
[73] YU J，ZHANG W，LU Y，et al.Reasoning on the relation：enhancing visual representation for visual question answering and cross-modal retrieval[J].IEEE Transactions on Multimedia，2020，22（12）：3196-3209.
[74] ZHU C，ZHAO Y，HUANG S，et al.Structured attentions for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：1291-1300.
[75] MALINOWSKI M，DOERSCH C，SANTORO A，et al.Learning visual question answering by bootstrapping hard attention[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：3-20.
[76] PATRO B，NAMBOODIRI V P.Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7680-7688.
[77] MANJUNATHA V，SAINI N，DAVIS L S.Explicit bias discovery in visual question answering models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：9562-9571.
[78] CADENE R，BEN YOUNES H，CORD M，et al.Murel：multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1989-1998.
[79] ZHOU Y，JI R，SU J，et al.Dynamic capsule attention for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：9324-9331.
[80] CAO Q，LIANG X，LI B，et al.Interpretable visual question answering by reasoning on dependency trees[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，43（3）：887-901.
[81] HOSSEINABAD S H，SAFAYANI M，MIRZAEI A.Multiple answers to a question：a new approach for visual question answering[J].The Visual Computer，2021，37（1）：119-131.
[82] FANG Z，LIU J，LI Y，et al.Improving visual question answering using dropout and enhanced question encoder[J].Pattern Recognition，2019，90：404-414.
[83] GOKHALE T，BANERJEE P，BARAL C，et al.Vqa-lol：visual question answering under the lens of logic[C]//European Conference on Computer Vision，2020：379-396.
[84] LIANG W，JIANG Y，LIU Z.GraghVQA：language-guided graph neural networks for graph-based visual question answering[J].arXiv：2104.10283，2021.
[85] GAO P，YOU H，ZHANG Z，et al.Multi-modality latent interaction network for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：5825-5835.
[86] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[87] LIU Y，ZHANG X，HUANG F，et al.Visual question answering via attention-based syntactic structure tree-LSTM[J].Applied Soft Computing，2019，82：105584.
[88] ZHU Y，LIM J J，FEI-FEI L.Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：1154-1163.
[89] BOJANOWSKI P，GRAVE E，JOULIN A，et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics，2017，5：135-146.
[90] SHIH K J，SINGH S，HOIEM D.Where to look：focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：4613-4621.
[91] HU R，ANDREAS J，ROHRBACH M，et al.Learning to reason：end-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：804-813.
[92] ADITYA S，YANG Y，BARAL C.Explicit reasoning over end-to-end neural architectures for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018.
[93] SPEER R，CHIN J，HAVASI C.ConceptNet 5.5：an open multilingual graph of general knowledge[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence，2017：4444-4451.
[94] GAO L，CAO L，XU X，et al.Question-led object attention for visual question answering[J].Neurocomputing，2020，391：227-233.
[95] LIU Y，ZHANG X，ZHANG Q，et al.Dual self-attention with co-attention networks for visual question answering[J].Pattern Recognition，2021，117：107956.