计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (17): 50-60.DOI: 10.3778/j.issn.1002-8331.2203-0243
王瑞平,吴士泓,张美航,王小平
出版日期:
2022-09-01
发布日期:
2022-09-01
WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping
Online:
2022-09-01
Published:
2022-09-01
摘要: 视觉问答中的语言处理方法对视觉问答模型的性能影响巨大。语言处理方法源于自然语言处理,但在发展过程中与自然语言处理领域最先进技术脱节,导致视觉问答中涉及的问题理解和答案生成受阻。产生这一问题的根源主观上是研究人员对语言处理方法的重要性认识不足,客观上则是相关研究文献的匮乏。针对上述问题,通过分析语言处理对视觉问答的价值,调查视觉问答中涉及到的语言处理方法和最新研究成果,归纳总结语言处理方法的类型,从而为研究人员认识语言处理重要性提供基础;探讨了自然语言处理技术对视觉问答中语言处理方法的推动作用,并展望了语言处理方法未来的发展方向。
王瑞平, 吴士泓, 张美航, 王小平. 视觉问答语言处理方法综述[J]. 计算机工程与应用, 2022, 58(17): 50-60.
WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping. Review of Language Processing Methods for Visual Question Answering[J]. Computer Engineering and Applications, 2022, 58(17): 50-60.
[1] ZHANG D,CAO R,WU S.Information fusion in visual question answering:a survey[J].Information Fusion,2019,52:268-280. [2] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [3] CHO K,VAN MERRIENBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406. 1078,2014. [4] MANMADHAN S,KOVOOR B C.Visual question answering:a state-of-the-art review[J].Artificial Intelligence Review,2020,53(8):5705-5745. [5] ZHANG W,YU J,ZHAO W,et al.DMRFNet:deep multimodal reasoning and fusion for visual question answering and explanation generation[J].Information Fusion,2021,72:70-79. [6] UROOJ A,KUEHNE H,DUARTE K,et al.Found a reason for me? weakly-supervised grounded visual question answering using capsules[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:8465-8474. [7] SHARMA H,JALAL A S.Visual question answering model based on graph neural network and contextual attention[J].Image and Vision Computing,2021:104165. [8] RAHMAN T,CHOU S H,SIGAL L,et al.An improved attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:1653-1662. [9] PENNINGTON J,SOCHER R,MANNING C D.Glove:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP),2014:1532-1543. [10] WHITEHEAD S,WU H,JI H,et al.Separating skills and concepts for novel visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:5632-5641. [11] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [12] 李舟军,范宇,吴贤杰.面向自然语言处理的预训练技术研究综述[J].计算机科学,2020,47(3):162-173. LI Z J,FAN Y,WU X J.Survey of natural language processing pre-training techniques[J].Computer Science,2020,47(3):162-173. [13] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems,2013:3111-3119. [14] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013. [15] PETERS M,NEUMANN M,IYYER M,et al.Deep contextualized word representations[J].arXiv:1802.05365,2018. [16] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].[2022-01-20].https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language understanding paper.pdf. [17] SUN Y,WANG S,LI Y,et al.Ernie:enhanced representation through knowledge integration[J].arXiv:1904.09223,2019. [18] ZHANG Z,HAN X,LIU Z,et al.ERNIE:enhanced language representation with informative entities[J].arXiv:1905.07129,2019. [19] YANG Z,DAI Z,YANG Y,et al.Xlnet:generalized autoregressive pretraining for language understanding[C]//Advances in Neural Information Processing Systems,2019. [20] 陈德光,马金林,马自萍,等.自然语言处理预训练技术综述[J].计算机科学与探索,2021,15(8):1359-1389. CHEN D G,MA J L,MA Z P,et al.Review of pre-training techniques for natural language processing[J].Journal of Frontiers of Computer Science and Technology,2021,15(8):1359-1389. [21] OTTER D W,MEDINA J R,KALITA J K.A survey of the usages of deep learning for natural language processing[J].IEEE Transactions on Neural Networks and Learning Systems,2020,32(2):604-624. [22] XU H,SAENKO K.Ask,attend and answer:exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision,2016:451-466. [23] WU Q,SHEN C,WANG P,et al.Image captioning and visual question answering based on attributes and external knowledge[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(6):1367-1381. [24] YU D,FU J,MEI T,et al.Multi-level attention networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:4709-4717. [25] YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:1821-1830. [26] BEN-YOUNES H,CADENE R,CORD M,et al.Mutan:multimodal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:2612-2620. [27] MALINOWSKI M,ROHRBACH M,FRITZ M.Ask your neurons:a deep learning approach to visual question answering[J].International Journal of Computer Vision,2017,125(1):110-135. [28] TENEY D,LIU L,VAN DEN HENGEL A.Graph-structured representations for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:1-9. [29] JANG Y,SONG Y,YU Y,et al.TGIF-QA:toward spatio-temporal reasoning in visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:2758-2766. [30] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:6077-6086. [31] MA C,SHEN C,DICK A,et al.Visual question answering with memory-augmented networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:6975-6984. [32] QIAO T,DONG J,XU D.Exploring human-like attention supervision in visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018. [33] SONG J,ZENG P,GAO L,et al.From pixels to objects:cubic visual attention for visual question answering[C]//Proceedings of IJCAI,2018:906-912. [34] SU Z,ZHU C,DONG Y,et al.Learning visual knowledge memory networks for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:7736-7745. [35] SHI Y,FURLANELLO T,ZHA S,et al.Question type guided attention in visual question answering[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:151-166. [36] BAI Y,FU J,ZHAO T,et al.Deep attention neural tensor network for visual question answering[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:20-35. [37] LIANG J,JIANG L,CAO L,et al.Focal visual-text attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:6135-6143. [38] NARASIMHAN M,LAZEBNIK S,SCHWING A.Out of the box:reasoning with graph convolution nets for factual visual question answering[J].arXiv:1811.00538,2018. [39] NARASIMHAN M,SCHWING A G.Straight to the facts:learning knowledge base retrieval for factual visual question answering[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:451-468. [40] NGUYEN D K,OKATANI T.Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:6087-6096. [41] TENEY D,VAN DEN HENGEL A.Visual question answering as a meta learning task[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:219-235. [42] GAO P,LI H,LI S,et al.Question-guided hybrid convolution for visual question answering[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:469-485. [43] LU P,LI H,ZHANG W,et al.Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018. [44] WU C,LIU J,WANG X,et al.Chain of reasoning for visual question answering[C]//Advances in Neural Information Processing Systems,2018:275-285. [45] WU C,LIU J,WANG X,et al.Object-difference attention:a simple relational attention for visual question answering[C]//Proceedings of the 26th ACM International Conference on Multimedia,2018:519-527. [46] DO T,DO T T,TRAN H,et al.Compact trilinear interaction for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:392-401. [47] GAO L,ZENG P,SONG J,et al.Structured two-stream attention network for video question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:6391-6398. [48] GAO P,JIANG Z,YOU H,et al.Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:6639-6648. [49] JHA S,DEY A,KUMAR R,et al.A novel approach on visual question answering by parameter prediction using faster region based convolutional neural network[J].International Journal of Interactive Multimedia and Artificial Intelligence,2019,5(5):30-37. [50] LI L,GAN Z,CHENG Y,et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:10313-10322. [51] LIU F,LIU J,FANG Z,et al.Densely connected attention flow for visual question answering[C]//Proceedings of IJCAI,2019:869-875. [52] OSMAN A,SAMEK W.DRAU:dual recurrent attention units for visual question answering[J].Computer Vision and Image Understanding,2019,185:24-30. [53] SHRESTHA R,KAFLE K,KANAN C.Answer them all! toward universal visual question answering models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:10472-10481. [54] YU Z,YU J,CUI Y,et al.Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:6281-6290. [55] HONG J,FU J,UH Y,et al.Exploiting hierarchical visual features for visual question answering[J].Neurocomputing,2019,351:187-195. [56] WU C,LIU J,WANG X,et al.Differential networks for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:8997-9004. [57] XI Y,ZHANG Y,DING S,et al.Visual question answering model based on visual relationship detection[J].Signal Processing:Image Communication,2020,80:115648. [58] DO T,NGUYEN B X,TRAN H,et al.Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering[C]//European Conference on Computer Vision,2020:496-510. [59] GAO D,WANG R,SHAN S,et al.Learning to recognize visual concepts for visual question answering with structural label space[J].IEEE Journal of Selected Topics in Signal Processing,2020,14(3):494-505. [60] HONG J,PARK S,BYUN H.Selective residual learning for visual question answering[J].Neurocomputing,2020,402:366-374. [61] LEI C,WU L,LIU D,et al.Multi-question learning for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:11328-11335. [62] YU J,ZHU Z,WANG Y,et al.Cross-modal knowledge reasoning for knowledge-based visual question answering[J].Pattern Recognition,2020,108:107563. [63] ZHANG L,LIU S,LIU D,et al.Rich visual knowledge-based augmentation network for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(10):4362-4373. [64] ZHANG W,YU J,HU H,et al.Multimodal feature fusion by relational reasoning and attention for visual question answering[J].Information Fusion,2020,55:116-126. [65] LIU Y,ZHANG X,HUANG F,et al.Adversarial learning with multi-modal attention for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(9):3894-3908. [66] KIM J,LEE D,WU J,et al.Visual question answering based on local-scene-aware referring expression generation[J].Neural Networks,2021,139:158-167. [67] GUO W,ZHANG Y,YANG J,et al.Re-attention for visual question answering[J].IEEE Transactions on Image Processing,2021,30:6730-6743. [68] LAO M,GUO Y,PU N,et al.Multi-stage hybrid embedding fusion network for visual question answering[J].Neurocomputing,2021,423:541-550. [69] LI H,HAN D.Multimodal encoders and decoders with gate attention for visual question answering[J].Computer Science and Information Systems,2021:32. [70] WU Y,MA Y,WAN S.Multi-scale relation reasoning for multi-modal visual question answering[J].Signal Processing:Image Communication,2021,96:116319. [71] ZHANG S,CHEN M,CHEN J,et al.Multimodal feature-wise co-attention method for visual question answering[J].Information Fusion,2021,73:1-10. [72] BAI Z,LI Y,WO?NIAK M,et al.DecomVQANet:decomposing visual question answering deep network via tensor decomposition and regression[J].Pattern Recognition,2021,110:107538. [73] YU J,ZHANG W,LU Y,et al.Reasoning on the relation:enhancing visual representation for visual question answering and cross-modal retrieval[J].IEEE Transactions on Multimedia,2020,22(12):3196-3209. [74] ZHU C,ZHAO Y,HUANG S,et al.Structured attentions for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:1291-1300. [75] MALINOWSKI M,DOERSCH C,SANTORO A,et al.Learning visual question answering by bootstrapping hard attention[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:3-20. [76] PATRO B,NAMBOODIRI V P.Differential attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:7680-7688. [77] MANJUNATHA V,SAINI N,DAVIS L S.Explicit bias discovery in visual question answering models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:9562-9571. [78] CADENE R,BEN YOUNES H,CORD M,et al.Murel:multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:1989-1998. [79] ZHOU Y,JI R,SU J,et al.Dynamic capsule attention for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:9324-9331. [80] CAO Q,LIANG X,LI B,et al.Interpretable visual question answering by reasoning on dependency trees[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(3):887-901. [81] HOSSEINABAD S H,SAFAYANI M,MIRZAEI A.Multiple answers to a question:a new approach for visual question answering[J].The Visual Computer,2021,37(1):119-131. [82] FANG Z,LIU J,LI Y,et al.Improving visual question answering using dropout and enhanced question encoder[J].Pattern Recognition,2019,90:404-414. [83] GOKHALE T,BANERJEE P,BARAL C,et al.Vqa-lol:visual question answering under the lens of logic[C]//European Conference on Computer Vision,2020:379-396. [84] LIANG W,JIANG Y,LIU Z.GraghVQA:language-guided graph neural networks for graph-based visual question answering[J].arXiv:2104.10283,2021. [85] GAO P,YOU H,ZHANG Z,et al.Multi-modality latent interaction network for visual question answering[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:5825-5835. [86] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems,2017:5998-6008. [87] LIU Y,ZHANG X,HUANG F,et al.Visual question answering via attention-based syntactic structure tree-LSTM[J].Applied Soft Computing,2019,82:105584. [88] ZHU Y,LIM J J,FEI-FEI L.Knowledge acquisition for visual question answering via iterative querying[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:1154-1163. [89] BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5:135-146. [90] SHIH K J,SINGH S,HOIEM D.Where to look:focus regions for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:4613-4621. [91] HU R,ANDREAS J,ROHRBACH M,et al.Learning to reason:end-to-end module networks for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:804-813. [92] ADITYA S,YANG Y,BARAL C.Explicit reasoning over end-to-end neural architectures for visual question answering[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018. [93] SPEER R,CHIN J,HAVASI C.ConceptNet 5.5:an open multilingual graph of general knowledge[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,2017:4444-4451. [94] GAO L,CAO L,XU X,et al.Question-led object attention for visual question answering[J].Neurocomputing,2020,391:227-233. [95] LIU Y,ZHANG X,ZHANG Q,et al.Dual self-attention with co-attention networks for visual question answering[J].Pattern Recognition,2021,117:107956. |
[1] | 伍洲, 张洪瑞, 张海军, 宋晴. 近邻场优化算法研究与应用综述[J]. 计算机工程与应用, 2022, 58(9): 1-8. |
[2] | 徐尹翔, 陈祺东, 孙俊. 应用量子行为粒子群优化算法的文本对抗[J]. 计算机工程与应用, 2022, 58(9): 175-180. |
[3] | 刘广, 涂刚, 李政, 刘译键, 占志强. 支持背景知识的多维端到端短语识别算法研究[J]. 计算机工程与应用, 2022, 58(8): 147-155. |
[4] | 蔡启明, 张磊, 许宸豪. 基于单层神经网络的流程相似性的研究[J]. 计算机工程与应用, 2022, 58(7): 295-302. |
[5] | 杨曦, 闫杰, 王文, 李少毅, 林健. 脑启发的视觉目标识别模型研究与展望[J]. 计算机工程与应用, 2022, 58(7): 1-20. |
[6] | 张明, 卢庆华, 黄元忠, 李瑞轩. 自然语言语法纠错的最新进展和挑战[J]. 计算机工程与应用, 2022, 58(6): 29-41. |
[7] | 马幪浩, 王喆. 小样本下基于Wasserstein距离的半监督学习算法[J]. 计算机工程与应用, 2022, 58(5): 193-199. |
[8] | 陈智丽, 高皓, 潘以轩, 邢风. 乳腺X线图像计算机辅助诊断技术综述[J]. 计算机工程与应用, 2022, 58(4): 1-21. |
[9] | 鞠思博, 徐晶, 李岩芳. 基于自注意力机制的文本生成单目标图像方法[J]. 计算机工程与应用, 2022, 58(3): 249-258. |
[10] | 吴迪, 姜丽婷, 王路路, 吐尔根·依布拉音, 艾山·吾买尔, 早克热·卡德尔. 结合多头注意力机制的旅游问句分类研究[J]. 计算机工程与应用, 2022, 58(3): 165-171. |
[11] | 朱良奇, 黄勃, 黄季涛, 马莉媛, 史志才. 融合BERT和自编码网络的短文本聚类研究[J]. 计算机工程与应用, 2022, 58(2): 145-152. |
[12] | 贠璟扬, 李学华, 向维. 语义导向多尺度多视图深度估计算法[J]. 计算机工程与应用, 2022, 58(2): 215-224. |
[13] | 唐焕玲, 王慧, 隗昊, 赵红磊, 窦全胜, 鲁明羽. 面向时钟领域的BERT-LCRF命名实体识别方法[J]. 计算机工程与应用, 2022, 58(18): 218-226. |
[14] | 王慧, 戚倩倩, 李雪, 孙卫佳, 刘莹, 姚春丽. 皮肤肿瘤图像自动分类的研究进展[J]. 计算机工程与应用, 2022, 58(16): 31-48. |
[15] | 孙宝山, 谭浩. 基于ALBERT-UniLM模型的文本自动摘要技术研究[J]. 计算机工程与应用, 2022, 58(15): 184-190. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||