Survey of Research on Deep Multimodal Representation Learning

doi:10.3778/j.issn.1002-8331.2206-0145

Abstract

Abstract: Although deep learning has been widely used in many fields because of its powerful nonlinear representation capabilities, the structural and semantic gap between multi-source heterogeneous modal data seriously hinders the application of subsequent deep learning models. Many scholars have proposed a large number of representation learning methods to explore the correlation and complementarity between different modalities, and improve the performance of deep learning prediction and generalization. However, the research on multimodal representation learning is still in its infancy, and there are still many scientific problems to be solved. So far, multimodal representation learning still lacks a unified cognition, and the architecture and evaluation metrics of multimodal representation learning research are not fully clear. According to the feature structure, semantic information and representation ability of different modalities, this paper studies and analyzes the progress of deep multimodal representation learning from the perspectives of representation fusion and representation alignment. And the existing research work is systematically summarized and scientifically classified. At the same time, this paper analyzes the basic structure, application scenarios and key issues of representative frameworks and models, analyzes the theoretical basis and latest development of deep multimodal representation learning, and points out the current challenges and future development of multimodal representation learning research, to further promote the development and application of deep multimodal representation learning.

Key words: multimodal representation, deep learning, multimodal fusion, multimodal alignment

摘要： 尽管深度学习因为强大的非线性表示能力已广泛应用于许多领域，多源异构模态数据间结构和语义上的鸿沟严重阻碍了后续深度学习模型的应用。虽然已经有许多学者提出了大量的表示学习方法以探索不同模态间的相关性和互补性，并提高深度学习预测和泛化性能。然而，多模态表示学习研究还处于初级阶段，依然存在许多科学问题尚需解决。迄今为止，多模态表示学习仍缺乏统一的认知，多模态表示学习研究的体系结构和评价指标尚不完全明确。根据不同模态的特征结构、语义信息和表示能力，从表示融合和表示对齐两个角度研究和分析了深度多模态表示学习的进展，并对现有研究工作进行了系统的总结和科学的分类。同时，解析了代表性框架和模型的基本结构、应用场景和关键问题，分析了深度多模态表示学习的理论基础和最新发展，并且指出了多模态表示学习研究当前面临的挑战和今后的发展趋势，以进一步推动深度多模态表示学习的发展和应用。

关键词: 多模态表示, 深度学习, 多模态融合, 多模态对齐

PAN Mengzhu, LI Qianmu, QIU Tian. Survey of Research on Deep Multimodal Representation Learning[J]. Computer Engineering and Applications, 2023, 59(2): 48-64.

潘梦竹, 李千目, 邱天. 深度多模态表示学习的研究综述[J]. 计算机工程与应用, 2023, 59(2): 48-64.

References

[1] RASIWASIA N，COSTA PEREIRA J，COVIELLO E，et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia，2010：251-260.
[2] LECUN Y，BENGIO Y，HINTON G.Deep learning[J].Nature，2015，521（7553）：436.
[3] FROME A L，CORRADO G S，SHLENS J B，et al.DeViSE：a deep visual-semantic embedding model[C]//Proceedings of NIPS，2013.
[4] ANDREW G，ARORA R，BILMES J，et al.Deep canonical correlation analysis[C]//International Conference on International Conference on Machine Learning，2013.
[5] PENG Y，QI J，YUAN Y.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing，2018，27（11）：5585-5599.
[6] CORTES C，VAPNIK V.Support-vector networks[J].Machine Learning，1995，20（3）：273-297.
[7] MORADE S S，PATNAIK S.Comparison of classifiers for lip reading with CUAVE and TULIPS database[J].Optik，2015，126（24）：5753-5761.
[8] NGIAM J，KHOSLA A，KIM M，et al.Multimodal deep learning[C]//Proceedings of ICML，2011.
[9] SRIVASTAVA N，SALAKHUTDINOV R.Multimodal learning with deep boltzmann machines[J].Journal of Machine Learning Research，2012，15（1）：2949-2980.
[10] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017.
[11] BALTRUSAITIS T，AHUJA C，MORENCY L P.Multimodal machine learning：a survey and taxonomy[J].IEEE Transactions on Pattern Analysis & Machine Intelligence，2018，41（2）：423-443.
[12] LI D，DIMITROVA N，LI M，et al.Multimedia content processing through cross-modal association[C]//Multimedia 03：Eleventh ACM International Conference on Multimedia，2003.
[13] KARPATHY A，LEE F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3128-3137.
[14] HOTELLING H.Relations between two sets of variates[J].Biometrika，1935，28：321-377.
[15] SALAKHUTDINOV R，LAROCHELLE H.Efficient learning of deep Boltzmann machines[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics，2010：693-700.
[16] HOLYOAK K J.Parallel distributed processing：explorations in the microstructure of cognition[J].Science，1987，236：992-997.
[17] PANG L，NGO C W.Mutlimodal learning with deep Boltzmann machine for emotion prediction in user generated videos[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval，2015：619-622.
[18] CHOI S，MATSUMURA S，AIZAWA K.Assist users’ interactions in font search with unexpected but useful concepts generated by multimodal learning[C]//Proceedings of the 2019 International Conference on Multimedia Retrieval，2019：235-243.
[19] LIU H，DENG S，WU L，et al.Recommendations for different tasks based on the uniform multimodal joint representation[J].Applied Sciences，2020，10（18）：6170.
[20] CHURCH K W.Word2Vec[J].Natural Language Engineering，2017，23（1）：155-162.
[21] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014.
[22] XU X，LIN K，YANG Y，et al.Joint feature synthesis and embedding：adversarial cross-modal retrieval revisited[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2022，44（6）：3030-3047.
[23] QI J，PENG Y.Cross-modal bidirectional translation via reinforcement learning[C]//Twenty-Seventh International Joint Conference on Artificial Intelligence，2018：2630-2636.
[24] ZHU H，WEIBEL J B，LU S.Discriminative multi-modal feature fusion for rgbd indoor scene recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2969-2976.
[25] SAHU G，VECHTOMOVA O.Adaptive fusion techniques for multimodal data[J].arXiv：1911.03821，2019.
[26] HONG D，YAO J，MENG D，et al.Multimodal GANs：toward crossmodal hyperspectral-multispectral image segmentation[J].IEEE Transactions on Geoscience and Remote Sensing，2020，59（6）：5103-5113.
[27] YU N，DAVIS L S，FRITZ M.Attributing fake images to GANs：learning and analyzing gan fingerprints[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：7556-7566.
[28] REED S，AKATA Z，YAN X，et al.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning，2016：1060-1069.
[29] REED S，AKATA Z，LEE H，et al.Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：49-58.
[30] HINZ T，HEINRICH S，WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2020，44（3）：1552-1565.
[31] XU T，ZHANG P，HUANG Q，et al.Attngan：fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：1316-1324.
[32] ZHANG H，KOH J Y，BALDRIDGE J，et al.Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：833-842.
[33] SALIMANS T，GOODFELLOW I，ZAREMBA W，et al.Improved techniques for training GANs[C]//Advances in Neural Information Processing Systems，2016.
[34] HINTON G E.Autoencoders，minimum description length and Helmholtz free energy[C]//Advances in Neural Information Processing Systems，San Mateo，1994.
[35] VINCENT P，LAROCHELLE H，BENGIO Y，et al.Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the Twenty-Fifth International Conference on Machine Learning，Helsinki，Finland，June 5-9，2008.
[36] FENG F，WANG X，LI R.Cross-modal retrieval with correspondence autoencoders[C]//Proceedings of the 22nd ACM International Conference on Multimedia，2014：7-16.
[37] SILBERER C，LAPATA M.Learning grounded meaning representations with autoencoders[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2014：721-732.
[38] KODIROV E，XIANG T，GONG S.Semantic autoencoder for zero-shot learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：3174-3183.
[39] SHEN T，JIA J，LI Y，et al.Enhancing music recommendation with social media content：an attentive multimodal autoencoder approach[C]//2020 International Joint Conference on Neural Networks（IJCNN），2020：1-8.
[40] HUANG K，ZHOU W，FANG M.Deep multimodal fusion autoencoder for saliency prediction of RGB-D images[J].Computational Intelligence and Neuroscience，2021：6610997.
[41] KINGMA D P，BA J.Adam：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[42] KHATTAR D，GOUD J S，GUPTA M，et al.Mvae：multimodal variational autoencoder for fake news detection[C]//The World Wide Web Conference，2019：2915-2921.
[43] YU H，OH J.Anytime 3D object reconstruction using multi-modal variational autoencoder[J].IEEE Robotics and Automation Letters，2022，7（2）：2162-2169.
[44] HORI C，HORI T，LEE T Y，et al.Attention-based multimodal fusion for video description[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：4193-4202.
[45] NAGRANI A，YANG S，ARNAB A，et al.Attention bottlenecks for multimodal fusion[C]//Advances in Neural Information Processing Systems，2021：14200-14213.
[46] ZADEH A，LIANG P P，PORIA S，et al.Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018.
[47] YANG Y H，WANG T，YIN L.Adaptive multimodal fusion for facial action units recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia，2020：2982-2990.
[48] DAI Y，GIESEKE F，OEHMCKE S，et al.Attentional feature fusion[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision，2021：3560-3569.
[49] WANG J，MAO H，LI H.FMFN：fine-grained multimodal fusion networks for fake news detection[J].Applied Sciences，2022，12（3）：1093.
[50] XUE H J，DAI X，ZHANG J，et al.Deep matrix factori-
zation models for recommender systems[C]//Proceedings of IJCAI，2017：3203-3209.
[51] WANG Y，MA F，JIN Z，et al.Eann：event adversarial neural networks for multi-modal fake news detection[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining，2018：849-857.
[52] ZADEH A，CHEN M，PORIA S，et al.Tensor fusion network for multimodal sentiment analysis[J].arXiv：1707.
07250，2017.
[53] PORIA S，CAMBRIA E，HAZARIKA D，et al.Multi-level multiple attentions for contextual multimodal sentiment analysis[C]//IEEE International Conference on Data Mining（ICDM），2017：1033-1038.
[54] RAJAGOPALAN S S，MORENCY L P，BALTRUSAITIS T，et al.Extending long short-term memory for multi-view structured learning[C]//European Conference on Computer Vision.Cham：Springer，2016：338-353.
[55] ABAVISANI M，JOZE H R V，PATEL V M.Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1165-1174.
[56] LIU P，ZHANG Z，YANG H，et al.Multi-modality empowered network for facial action unit detection[C]//2019 IEEE Winter Conference on Applications of Computer Vision（WACV），2019：2175-2184.
[57] JIN Z，CAO J，GUO H，et al.Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C]//Proceedings of the 25th ACM International Conference on Multimedia，2017：795-816.
[58] SONG C，NING N，ZHANG Y，et al.A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks[J].Information Processing & Management，2021，58（1）：102437.
[59] HARDOON D，SZEDMAK S，SHAWE-TAYLOR J.Canonical correlation analysis：an overview with application to learning methods[J].Neural Computation，2004，16（12）：2639-2664.
[60] AKAHO S.A kernel method for canonical correlation analysis[J].arXiv：cs/0609071，2006.
[61] MALLINAR N，ROSSET C.Deep canonically correlated LSTMs[J].arXiv：1801.05407，2018.
[62] WANG W，ARORA R，LIVESCU K，et al.On deep multi-view representation learning[C]//International Conference on Machine Learning，2015：1083-1092.
[63] YU Y，TANG S，AIZAWA K，et al.Category-based deep CCA for fine-grained venue discovery from multimodal data[J].IEEE Transactions on Neural Networks and Learning Systems，2018，30（4）：1250-1258.
[64] LIU W，QIU J L，ZHENG W L，et al.Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition[J].IEEE Transactions on Cognitive and Developmental Systems，2022，14（2）：715-729.
[65] DESHMUKH S，ABHYANKAR A，KELKAR S.DCCA and DMCCA framework for multimodal biometric system[J].Multimedia Tools and Applications，2022：1-15.
[66] YALE S，MOHAMMAD S.Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the International Conference on Computer Vision and Pattern Recognition（CVPR’19），2019.
[67] LIN Z，FENG M，SANTOS C N，et al.A structured self-attentive sentence embedding[J].arXiv：1703.03130，2017.
[68] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[69] DIETTERICH T G，LATHROP R H，LOZANO-PéREZ T.Solving the multiple instance problem with axis-parallel rectangles[J].Artificial Intelligence，1997，89（1/2）：31-71.
[70] WEHRMANN J，KOLLING C，BARROS R C.Adaptive cross-modal embeddings for image-text alignment[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020，34（7）：12313-12320.
[71] LI Y，ZHU Z，YU J G，et al.Learning deep cross-modal embedding networks for zero-shot remote sensing image scene classification[J].IEEE Transactions on Geoscience and Remote Sensing，2021，59（12）：10590-10603.
[72] LEE K H，CHEN X，HUA G，et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：201-216.
[73] PENG Y，QI J，ZHUO Y.MAVA：multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism[J].IEEE Transactions on Image Processing，2019，29：2728-2741.
[74] QU L，LIU M，CAO D，et al.Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia，2020：1047-1055.
[75] MESSINA N，AMATO G，ESULI A，et al.Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J].ACM Transactions on Multimedia Computing，Communications，and Applications，2021，17（4）：1-23.
[76] TSAI Y H H，BAI S，LIANG P P，et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics，2019.
[77] LIU P，LI K，MENG H.Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition[J].arXiv：2201.06309，2022.
[78] FAGHRI F，FLEET D J，KIROS J R，et al.Vse++：improving visual-semantic embeddings with hard negatives[J].arXiv：1707.05612，2017.
[79] ZHENG Z，ZHENG L，GARRETT M，et al.Dual-path convolutional image-text embeddings with instance loss[J].ACM Transactions on Multimedia Computing，Communications，and Applications（TOMM），2020，16（2）：1-23.
[80] HUANG Y，WU Q，SONG C，et al.Learning semantic concepts and order for image and sentence matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6163-6171.
[81] LI Y，WANG D，HU H，et al.Zero-shot recognition using dual visual-semantic mapping paths[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：3279-3287.
[82] TAO S Y，YEH Y R，WANG Y C F.Semantics-preserving locality embedding for zero-shot learning[C]//Proceedings of BMVC，2017.
[83] LI K，ZHANG Y，LI K，et al.Visual semantic reasoning for image-text matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：4654-4662.
[84] JI Z，WANG H，HAN J，et al.Saliency-guided attention network for image-sentence matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：5754-5763.
[85] NEUMANN M，VU N T.Attentive convolutional neural network based speech emotion recognition：a study on the impact of input features，signal length，and acted speech[J].arXiv：1706.00612，2017.
[86] RAMET G，GARNER P N，BAERISWYL M，et al.Context-aware attention mechanism for speech emotion recognition[C]//2018 IEEE Spoken Language Technology Workshop，2018：126-131.
[87] TARANTINO L，GARNER P N，LAZARIDIS A.Self-attention for speech emotion recognition[C]//Proceedings of INTERSPEECH，2019：2578-2582.

[88] GAO J，LYU T，XIONG F，et al.Mgnn：a multimodal graph neural network for predicting the survival of cancer patients[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval，2020：1697-1700.

[89] PAN S J，QIANG Y.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering，2010，22（10）：1345-1359.