计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (2): 48-64.DOI: 10.3778/j.issn.1002-8331.2206-0145
潘梦竹,李千目,邱天
出版日期:
2023-01-15
发布日期:
2023-01-15
PAN Mengzhu, LI Qianmu, QIU Tian
Online:
2023-01-15
Published:
2023-01-15
摘要: 尽管深度学习因为强大的非线性表示能力已广泛应用于许多领域,多源异构模态数据间结构和语义上的鸿沟严重阻碍了后续深度学习模型的应用。虽然已经有许多学者提出了大量的表示学习方法以探索不同模态间的相关性和互补性,并提高深度学习预测和泛化性能。然而,多模态表示学习研究还处于初级阶段,依然存在许多科学问题尚需解决。迄今为止,多模态表示学习仍缺乏统一的认知,多模态表示学习研究的体系结构和评价指标尚不完全明确。根据不同模态的特征结构、语义信息和表示能力,从表示融合和表示对齐两个角度研究和分析了深度多模态表示学习的进展,并对现有研究工作进行了系统的总结和科学的分类。同时,解析了代表性框架和模型的基本结构、应用场景和关键问题,分析了深度多模态表示学习的理论基础和最新发展,并且指出了多模态表示学习研究当前面临的挑战和今后的发展趋势,以进一步推动深度多模态表示学习的发展和应用。
潘梦竹, 李千目, 邱天. 深度多模态表示学习的研究综述[J]. 计算机工程与应用, 2023, 59(2): 48-64.
PAN Mengzhu, LI Qianmu, QIU Tian. Survey of Research on Deep Multimodal Representation Learning[J]. Computer Engineering and Applications, 2023, 59(2): 48-64.
[1] RASIWASIA N,COSTA PEREIRA J,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia,2010:251-260. [2] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436. [3] FROME A L,CORRADO G S,SHLENS J B,et al.DeViSE:a deep visual-semantic embedding model[C]//Proceedings of NIPS,2013. [4] ANDREW G,ARORA R,BILMES J,et al.Deep canonical correlation analysis[C]//International Conference on International Conference on Machine Learning,2013. [5] PENG Y,QI J,YUAN Y.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing,2018,27(11):5585-5599. [6] CORTES C,VAPNIK V.Support-vector networks[J].Machine Learning,1995,20(3):273-297. [7] MORADE S S,PATNAIK S.Comparison of classifiers for lip reading with CUAVE and TULIPS database[J].Optik,2015,126(24):5753-5761. [8] NGIAM J,KHOSLA A,KIM M,et al.Multimodal deep learning[C]//Proceedings of ICML,2011. [9] SRIVASTAVA N,SALAKHUTDINOV R.Multimodal learning with deep boltzmann machines[J].Journal of Machine Learning Research,2012,15(1):2949-2980. [10] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems,2017. [11] BALTRUSAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:a survey and taxonomy[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2018,41(2):423-443. [12] LI D,DIMITROVA N,LI M,et al.Multimedia content processing through cross-modal association[C]//Multimedia 03:Eleventh ACM International Conference on Multimedia,2003. [13] KARPATHY A,LEE F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2015:3128-3137. [14] HOTELLING H.Relations between two sets of variates[J].Biometrika,1935,28:321-377. [15] SALAKHUTDINOV R,LAROCHELLE H.Efficient learning of deep Boltzmann machines[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,2010:693-700. [16] HOLYOAK K J.Parallel distributed processing:explorations in the microstructure of cognition[J].Science,1987,236:992-997. [17] PANG L,NGO C W.Mutlimodal learning with deep Boltzmann machine for emotion prediction in user generated videos[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval,2015:619-622. [18] CHOI S,MATSUMURA S,AIZAWA K.Assist users’ interactions in font search with unexpected but useful concepts generated by multimodal learning[C]//Proceedings of the 2019 International Conference on Multimedia Retrieval,2019:235-243. [19] LIU H,DENG S,WU L,et al.Recommendations for different tasks based on the uniform multimodal joint representation[J].Applied Sciences,2020,10(18):6170. [20] CHURCH K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162. [21] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems,2014. [22] XU X,LIN K,YANG Y,et al.Joint feature synthesis and embedding:adversarial cross-modal retrieval revisited[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(6):3030-3047. [23] QI J,PENG Y.Cross-modal bidirectional translation via reinforcement learning[C]//Twenty-Seventh International Joint Conference on Artificial Intelligence,2018:2630-2636. [24] ZHU H,WEIBEL J B,LU S.Discriminative multi-modal feature fusion for rgbd indoor scene recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:2969-2976. [25] SAHU G,VECHTOMOVA O.Adaptive fusion techniques for multimodal data[J].arXiv:1911.03821,2019. [26] HONG D,YAO J,MENG D,et al.Multimodal GANs:toward crossmodal hyperspectral-multispectral image segmentation[J].IEEE Transactions on Geoscience and Remote Sensing,2020,59(6):5103-5113. [27] YU N,DAVIS L S,FRITZ M.Attributing fake images to GANs:learning and analyzing gan fingerprints[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:7556-7566. [28] REED S,AKATA Z,YAN X,et al.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning,2016:1060-1069. [29] REED S,AKATA Z,LEE H,et al.Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:49-58. [30] HINZ T,HEINRICH S,WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,44(3):1552-1565. [31] XU T,ZHANG P,HUANG Q,et al.Attngan:fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:1316-1324. [32] ZHANG H,KOH J Y,BALDRIDGE J,et al.Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:833-842. [33] SALIMANS T,GOODFELLOW I,ZAREMBA W,et al.Improved techniques for training GANs[C]//Advances in Neural Information Processing Systems,2016. [34] HINTON G E.Autoencoders,minimum description length and Helmholtz free energy[C]//Advances in Neural Information Processing Systems,San Mateo,1994. [35] VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the Twenty-Fifth International Conference on Machine Learning,Helsinki,Finland,June 5-9,2008. [36] FENG F,WANG X,LI R.Cross-modal retrieval with correspondence autoencoders[C]//Proceedings of the 22nd ACM International Conference on Multimedia,2014:7-16. [37] SILBERER C,LAPATA M.Learning grounded meaning representations with autoencoders[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),2014:721-732. [38] KODIROV E,XIANG T,GONG S.Semantic autoencoder for zero-shot learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:3174-3183. [39] SHEN T,JIA J,LI Y,et al.Enhancing music recommendation with social media content:an attentive multimodal autoencoder approach[C]//2020 International Joint Conference on Neural Networks(IJCNN),2020:1-8. [40] HUANG K,ZHOU W,FANG M.Deep multimodal fusion autoencoder for saliency prediction of RGB-D images[J].Computational Intelligence and Neuroscience,2021:6610997. [41] KINGMA D P,BA J.Adam:a method for stochastic optimization[J].arXiv:1412.6980,2014. [42] KHATTAR D,GOUD J S,GUPTA M,et al.Mvae:multimodal variational autoencoder for fake news detection[C]//The World Wide Web Conference,2019:2915-2921. [43] YU H,OH J.Anytime 3D object reconstruction using multi-modal variational autoencoder[J].IEEE Robotics and Automation Letters,2022,7(2):2162-2169. [44] HORI C,HORI T,LEE T Y,et al.Attention-based multimodal fusion for video description[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:4193-4202. [45] NAGRANI A,YANG S,ARNAB A,et al.Attention bottlenecks for multimodal fusion[C]//Advances in Neural Information Processing Systems,2021:14200-14213. [46] ZADEH A,LIANG P P,PORIA S,et al.Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018. [47] YANG Y H,WANG T,YIN L.Adaptive multimodal fusion for facial action units recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020:2982-2990. [48] DAI Y,GIESEKE F,OEHMCKE S,et al.Attentional feature fusion[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,2021:3560-3569. [49] WANG J,MAO H,LI H.FMFN:fine-grained multimodal fusion networks for fake news detection[J].Applied Sciences,2022,12(3):1093. [50] XUE H J,DAI X,ZHANG J,et al.Deep matrix factori- zation models for recommender systems[C]//Proceedings of IJCAI,2017:3203-3209. [51] WANG Y,MA F,JIN Z,et al.Eann:event adversarial neural networks for multi-modal fake news detection[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,2018:849-857. [52] ZADEH A,CHEN M,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[J].arXiv:1707. 07250,2017. [53] PORIA S,CAMBRIA E,HAZARIKA D,et al.Multi-level multiple attentions for contextual multimodal sentiment analysis[C]//IEEE International Conference on Data Mining(ICDM),2017:1033-1038. [54] RAJAGOPALAN S S,MORENCY L P,BALTRUSAITIS T,et al.Extending long short-term memory for multi-view structured learning[C]//European Conference on Computer Vision.Cham:Springer,2016:338-353. [55] ABAVISANI M,JOZE H R V,PATEL V M.Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:1165-1174. [56] LIU P,ZHANG Z,YANG H,et al.Multi-modality empowered network for facial action unit detection[C]//2019 IEEE Winter Conference on Applications of Computer Vision(WACV),2019:2175-2184. [57] JIN Z,CAO J,GUO H,et al.Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C]//Proceedings of the 25th ACM International Conference on Multimedia,2017:795-816. [58] SONG C,NING N,ZHANG Y,et al.A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks[J].Information Processing & Management,2021,58(1):102437. [59] HARDOON D,SZEDMAK S,SHAWE-TAYLOR J.Canonical correlation analysis:an overview with application to learning methods[J].Neural Computation,2004,16(12):2639-2664. [60] AKAHO S.A kernel method for canonical correlation analysis[J].arXiv:cs/0609071,2006. [61] MALLINAR N,ROSSET C.Deep canonically correlated LSTMs[J].arXiv:1801.05407,2018. [62] WANG W,ARORA R,LIVESCU K,et al.On deep multi-view representation learning[C]//International Conference on Machine Learning,2015:1083-1092. [63] YU Y,TANG S,AIZAWA K,et al.Category-based deep CCA for fine-grained venue discovery from multimodal data[J].IEEE Transactions on Neural Networks and Learning Systems,2018,30(4):1250-1258. [64] LIU W,QIU J L,ZHENG W L,et al.Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition[J].IEEE Transactions on Cognitive and Developmental Systems,2022,14(2):715-729. [65] DESHMUKH S,ABHYANKAR A,KELKAR S.DCCA and DMCCA framework for multimodal biometric system[J].Multimedia Tools and Applications,2022:1-15. [66] YALE S,MOHAMMAD S.Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the International Conference on Computer Vision and Pattern Recognition(CVPR’19),2019. [67] LIN Z,FENG M,SANTOS C N,et al.A structured self-attentive sentence embedding[J].arXiv:1703.03130,2017. [68] HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:770-778. [69] DIETTERICH T G,LATHROP R H,LOZANO-PéREZ T.Solving the multiple instance problem with axis-parallel rectangles[J].Artificial Intelligence,1997,89(1/2):31-71. [70] WEHRMANN J,KOLLING C,BARROS R C.Adaptive cross-modal embeddings for image-text alignment[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(7):12313-12320. [71] LI Y,ZHU Z,YU J G,et al.Learning deep cross-modal embedding networks for zero-shot remote sensing image scene classification[J].IEEE Transactions on Geoscience and Remote Sensing,2021,59(12):10590-10603. [72] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:201-216. [73] PENG Y,QI J,ZHUO Y.MAVA:multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism[J].IEEE Transactions on Image Processing,2019,29:2728-2741. [74] QU L,LIU M,CAO D,et al.Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020:1047-1055. [75] MESSINA N,AMATO G,ESULI A,et al.Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2021,17(4):1-23. [76] TSAI Y H H,BAI S,LIANG P P,et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,2019. [77] LIU P,LI K,MENG H.Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition[J].arXiv:2201.06309,2022. [78] FAGHRI F,FLEET D J,KIROS J R,et al.Vse++:improving visual-semantic embeddings with hard negatives[J].arXiv:1707.05612,2017. [79] ZHENG Z,ZHENG L,GARRETT M,et al.Dual-path convolutional image-text embeddings with instance loss[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2020,16(2):1-23. [80] HUANG Y,WU Q,SONG C,et al.Learning semantic concepts and order for image and sentence matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:6163-6171. [81] LI Y,WANG D,HU H,et al.Zero-shot recognition using dual visual-semantic mapping paths[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:3279-3287. [82] TAO S Y,YEH Y R,WANG Y C F.Semantics-preserving locality embedding for zero-shot learning[C]//Proceedings of BMVC,2017. [83] LI K,ZHANG Y,LI K,et al.Visual semantic reasoning for image-text matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:4654-4662. [84] JI Z,WANG H,HAN J,et al.Saliency-guided attention network for image-sentence matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:5754-5763. [85] NEUMANN M,VU N T.Attentive convolutional neural network based speech emotion recognition:a study on the impact of input features,signal length,and acted speech[J].arXiv:1706.00612,2017. [86] RAMET G,GARNER P N,BAERISWYL M,et al.Context-aware attention mechanism for speech emotion recognition[C]//2018 IEEE Spoken Language Technology Workshop,2018:126-131. [87] TARANTINO L,GARNER P N,LAZARIDIS A.Self-attention for speech emotion recognition[C]//Proceedings of INTERSPEECH,2019:2578-2582. [88] GAO J,LYU T,XIONG F,et al.Mgnn:a multimodal graph neural network for predicting the survival of cancer patients[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval,2020:1697-1700. [89] PAN S J,QIANG Y.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359. |
[1] | 淦亚婷, 安建业, 徐雪. 基于深度学习的短文本分类方法研究综述[J]. 计算机工程与应用, 2023, 59(4): 43-53. |
[2] | 杨坤融, 熊余, 张健, 储雯. 面向长短期混合数据的MOOC辍学预测策略研究[J]. 计算机工程与应用, 2023, 59(4): 130-138. |
[3] | 李玲, 郭广颂. 融合指标分组的高维混合多目标进化优化[J]. 计算机工程与应用, 2023, 59(4): 165-174. |
[4] | 胡欣珏, 付章杰. 高图像质量的一图藏两图方法[J]. 计算机工程与应用, 2023, 59(4): 235-242. |
[5] | 杨寒雨, 赵晓永, 王磊. 数据归一化方法综述[J]. 计算机工程与应用, 2023, 59(3): 13-22. |
[6] | 陈晓婷, 李实. 对话情绪识别综述[J]. 计算机工程与应用, 2023, 59(3): 33-48. |
[7] | 杜昱峥, 曹慧, 聂永琦, 魏德健, 冯妍妍. 深度学习在阿尔茨海默病分类诊断中的应用[J]. 计算机工程与应用, 2023, 59(3): 49-65. |
[8] | 林鸿辉, 刘建华, 郑智雄, 胡任远, 罗逸轩. 联合对话行为识别与情感分类的多任务网络[J]. 计算机工程与应用, 2023, 59(3): 104-111. |
[9] | 丁上上, 郑田莉, 姚康, 张贺童, 裴融浩, 付威威. 深度学习屈光检测方法研究[J]. 计算机工程与应用, 2023, 59(3): 193-201. |
[10] | 张冬冬, 郭杰, 陈阳. 基于原始点云的三维目标检测算法[J]. 计算机工程与应用, 2023, 59(3): 209-217. |
[11] | 张晗, 郑伟昊, 窦志成, 文继荣. 融合法律文本结构信息的刑事案件判决预测[J]. 计算机工程与应用, 2023, 59(3): 253-263. |
[12] | 林令德, 刘纳, 王正安. Adapter与Prompt Tuning微调方法研究综述[J]. 计算机工程与应用, 2023, 59(2): 12-21. |
[13] | 裴文斌, 王海龙, 柳林, 裴冬梅. 音乐信息检索下的乐器识别综述[J]. 计算机工程与应用, 2023, 59(2): 34-47. |
[14] | 韦世红, 刘红梅, 唐宏, 朱龙娇. 多级度量网络的小样本学习[J]. 计算机工程与应用, 2023, 59(2): 94-101. |
[15] | 杨秀璋, 武帅, 杨琪, 项美玉, 李娜, 周既松, 赵小明. 多视图融合TextRCNN的论文自动推荐算法[J]. 计算机工程与应用, 2023, 59(2): 110-119. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||