Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (2): 48-64.DOI: 10.3778/j.issn.1002-8331.2206-0145
• Research Hotspots and Reviews • Previous Articles Next Articles
PAN Mengzhu, LI Qianmu, QIU Tian
Online:
2023-01-15
Published:
2023-01-15
潘梦竹,李千目,邱天
PAN Mengzhu, LI Qianmu, QIU Tian. Survey of Research on Deep Multimodal Representation Learning[J]. Computer Engineering and Applications, 2023, 59(2): 48-64.
潘梦竹, 李千目, 邱天. 深度多模态表示学习的研究综述[J]. 计算机工程与应用, 2023, 59(2): 48-64.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2206-0145
[1] RASIWASIA N,COSTA PEREIRA J,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia,2010:251-260. [2] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436. [3] FROME A L,CORRADO G S,SHLENS J B,et al.DeViSE:a deep visual-semantic embedding model[C]//Proceedings of NIPS,2013. [4] ANDREW G,ARORA R,BILMES J,et al.Deep canonical correlation analysis[C]//International Conference on International Conference on Machine Learning,2013. [5] PENG Y,QI J,YUAN Y.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing,2018,27(11):5585-5599. [6] CORTES C,VAPNIK V.Support-vector networks[J].Machine Learning,1995,20(3):273-297. [7] MORADE S S,PATNAIK S.Comparison of classifiers for lip reading with CUAVE and TULIPS database[J].Optik,2015,126(24):5753-5761. [8] NGIAM J,KHOSLA A,KIM M,et al.Multimodal deep learning[C]//Proceedings of ICML,2011. [9] SRIVASTAVA N,SALAKHUTDINOV R.Multimodal learning with deep boltzmann machines[J].Journal of Machine Learning Research,2012,15(1):2949-2980. [10] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems,2017. [11] BALTRUSAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:a survey and taxonomy[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2018,41(2):423-443. [12] LI D,DIMITROVA N,LI M,et al.Multimedia content processing through cross-modal association[C]//Multimedia 03:Eleventh ACM International Conference on Multimedia,2003. [13] KARPATHY A,LEE F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2015:3128-3137. [14] HOTELLING H.Relations between two sets of variates[J].Biometrika,1935,28:321-377. [15] SALAKHUTDINOV R,LAROCHELLE H.Efficient learning of deep Boltzmann machines[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,2010:693-700. [16] HOLYOAK K J.Parallel distributed processing:explorations in the microstructure of cognition[J].Science,1987,236:992-997. [17] PANG L,NGO C W.Mutlimodal learning with deep Boltzmann machine for emotion prediction in user generated videos[C]//Proceedings of the 5th ACM on International Conference on Multimedia Retrieval,2015:619-622. [18] CHOI S,MATSUMURA S,AIZAWA K.Assist users’ interactions in font search with unexpected but useful concepts generated by multimodal learning[C]//Proceedings of the 2019 International Conference on Multimedia Retrieval,2019:235-243. [19] LIU H,DENG S,WU L,et al.Recommendations for different tasks based on the uniform multimodal joint representation[J].Applied Sciences,2020,10(18):6170. [20] CHURCH K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162. [21] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems,2014. [22] XU X,LIN K,YANG Y,et al.Joint feature synthesis and embedding:adversarial cross-modal retrieval revisited[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(6):3030-3047. [23] QI J,PENG Y.Cross-modal bidirectional translation via reinforcement learning[C]//Twenty-Seventh International Joint Conference on Artificial Intelligence,2018:2630-2636. [24] ZHU H,WEIBEL J B,LU S.Discriminative multi-modal feature fusion for rgbd indoor scene recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:2969-2976. [25] SAHU G,VECHTOMOVA O.Adaptive fusion techniques for multimodal data[J].arXiv:1911.03821,2019. [26] HONG D,YAO J,MENG D,et al.Multimodal GANs:toward crossmodal hyperspectral-multispectral image segmentation[J].IEEE Transactions on Geoscience and Remote Sensing,2020,59(6):5103-5113. [27] YU N,DAVIS L S,FRITZ M.Attributing fake images to GANs:learning and analyzing gan fingerprints[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:7556-7566. [28] REED S,AKATA Z,YAN X,et al.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning,2016:1060-1069. [29] REED S,AKATA Z,LEE H,et al.Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:49-58. [30] HINZ T,HEINRICH S,WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,44(3):1552-1565. [31] XU T,ZHANG P,HUANG Q,et al.Attngan:fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:1316-1324. [32] ZHANG H,KOH J Y,BALDRIDGE J,et al.Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:833-842. [33] SALIMANS T,GOODFELLOW I,ZAREMBA W,et al.Improved techniques for training GANs[C]//Advances in Neural Information Processing Systems,2016. [34] HINTON G E.Autoencoders,minimum description length and Helmholtz free energy[C]//Advances in Neural Information Processing Systems,San Mateo,1994. [35] VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[C]//Proceedings of the Twenty-Fifth International Conference on Machine Learning,Helsinki,Finland,June 5-9,2008. [36] FENG F,WANG X,LI R.Cross-modal retrieval with correspondence autoencoders[C]//Proceedings of the 22nd ACM International Conference on Multimedia,2014:7-16. [37] SILBERER C,LAPATA M.Learning grounded meaning representations with autoencoders[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),2014:721-732. [38] KODIROV E,XIANG T,GONG S.Semantic autoencoder for zero-shot learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:3174-3183. [39] SHEN T,JIA J,LI Y,et al.Enhancing music recommendation with social media content:an attentive multimodal autoencoder approach[C]//2020 International Joint Conference on Neural Networks(IJCNN),2020:1-8. [40] HUANG K,ZHOU W,FANG M.Deep multimodal fusion autoencoder for saliency prediction of RGB-D images[J].Computational Intelligence and Neuroscience,2021:6610997. [41] KINGMA D P,BA J.Adam:a method for stochastic optimization[J].arXiv:1412.6980,2014. [42] KHATTAR D,GOUD J S,GUPTA M,et al.Mvae:multimodal variational autoencoder for fake news detection[C]//The World Wide Web Conference,2019:2915-2921. [43] YU H,OH J.Anytime 3D object reconstruction using multi-modal variational autoencoder[J].IEEE Robotics and Automation Letters,2022,7(2):2162-2169. [44] HORI C,HORI T,LEE T Y,et al.Attention-based multimodal fusion for video description[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:4193-4202. [45] NAGRANI A,YANG S,ARNAB A,et al.Attention bottlenecks for multimodal fusion[C]//Advances in Neural Information Processing Systems,2021:14200-14213. [46] ZADEH A,LIANG P P,PORIA S,et al.Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018. [47] YANG Y H,WANG T,YIN L.Adaptive multimodal fusion for facial action units recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020:2982-2990. [48] DAI Y,GIESEKE F,OEHMCKE S,et al.Attentional feature fusion[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,2021:3560-3569. [49] WANG J,MAO H,LI H.FMFN:fine-grained multimodal fusion networks for fake news detection[J].Applied Sciences,2022,12(3):1093. [50] XUE H J,DAI X,ZHANG J,et al.Deep matrix factori- zation models for recommender systems[C]//Proceedings of IJCAI,2017:3203-3209. [51] WANG Y,MA F,JIN Z,et al.Eann:event adversarial neural networks for multi-modal fake news detection[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,2018:849-857. [52] ZADEH A,CHEN M,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[J].arXiv:1707. 07250,2017. [53] PORIA S,CAMBRIA E,HAZARIKA D,et al.Multi-level multiple attentions for contextual multimodal sentiment analysis[C]//IEEE International Conference on Data Mining(ICDM),2017:1033-1038. [54] RAJAGOPALAN S S,MORENCY L P,BALTRUSAITIS T,et al.Extending long short-term memory for multi-view structured learning[C]//European Conference on Computer Vision.Cham:Springer,2016:338-353. [55] ABAVISANI M,JOZE H R V,PATEL V M.Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:1165-1174. [56] LIU P,ZHANG Z,YANG H,et al.Multi-modality empowered network for facial action unit detection[C]//2019 IEEE Winter Conference on Applications of Computer Vision(WACV),2019:2175-2184. [57] JIN Z,CAO J,GUO H,et al.Multimodal fusion with recurrent neural networks for rumor detection on microblogs[C]//Proceedings of the 25th ACM International Conference on Multimedia,2017:795-816. [58] SONG C,NING N,ZHANG Y,et al.A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks[J].Information Processing & Management,2021,58(1):102437. [59] HARDOON D,SZEDMAK S,SHAWE-TAYLOR J.Canonical correlation analysis:an overview with application to learning methods[J].Neural Computation,2004,16(12):2639-2664. [60] AKAHO S.A kernel method for canonical correlation analysis[J].arXiv:cs/0609071,2006. [61] MALLINAR N,ROSSET C.Deep canonically correlated LSTMs[J].arXiv:1801.05407,2018. [62] WANG W,ARORA R,LIVESCU K,et al.On deep multi-view representation learning[C]//International Conference on Machine Learning,2015:1083-1092. [63] YU Y,TANG S,AIZAWA K,et al.Category-based deep CCA for fine-grained venue discovery from multimodal data[J].IEEE Transactions on Neural Networks and Learning Systems,2018,30(4):1250-1258. [64] LIU W,QIU J L,ZHENG W L,et al.Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition[J].IEEE Transactions on Cognitive and Developmental Systems,2022,14(2):715-729. [65] DESHMUKH S,ABHYANKAR A,KELKAR S.DCCA and DMCCA framework for multimodal biometric system[J].Multimedia Tools and Applications,2022:1-15. [66] YALE S,MOHAMMAD S.Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the International Conference on Computer Vision and Pattern Recognition(CVPR’19),2019. [67] LIN Z,FENG M,SANTOS C N,et al.A structured self-attentive sentence embedding[J].arXiv:1703.03130,2017. [68] HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:770-778. [69] DIETTERICH T G,LATHROP R H,LOZANO-PéREZ T.Solving the multiple instance problem with axis-parallel rectangles[J].Artificial Intelligence,1997,89(1/2):31-71. [70] WEHRMANN J,KOLLING C,BARROS R C.Adaptive cross-modal embeddings for image-text alignment[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020,34(7):12313-12320. [71] LI Y,ZHU Z,YU J G,et al.Learning deep cross-modal embedding networks for zero-shot remote sensing image scene classification[J].IEEE Transactions on Geoscience and Remote Sensing,2021,59(12):10590-10603. [72] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision(ECCV),2018:201-216. [73] PENG Y,QI J,ZHUO Y.MAVA:multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism[J].IEEE Transactions on Image Processing,2019,29:2728-2741. [74] QU L,LIU M,CAO D,et al.Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020:1047-1055. [75] MESSINA N,AMATO G,ESULI A,et al.Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2021,17(4):1-23. [76] TSAI Y H H,BAI S,LIANG P P,et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,2019. [77] LIU P,LI K,MENG H.Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition[J].arXiv:2201.06309,2022. [78] FAGHRI F,FLEET D J,KIROS J R,et al.Vse++:improving visual-semantic embeddings with hard negatives[J].arXiv:1707.05612,2017. [79] ZHENG Z,ZHENG L,GARRETT M,et al.Dual-path convolutional image-text embeddings with instance loss[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2020,16(2):1-23. [80] HUANG Y,WU Q,SONG C,et al.Learning semantic concepts and order for image and sentence matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:6163-6171. [81] LI Y,WANG D,HU H,et al.Zero-shot recognition using dual visual-semantic mapping paths[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:3279-3287. [82] TAO S Y,YEH Y R,WANG Y C F.Semantics-preserving locality embedding for zero-shot learning[C]//Proceedings of BMVC,2017. [83] LI K,ZHANG Y,LI K,et al.Visual semantic reasoning for image-text matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:4654-4662. [84] JI Z,WANG H,HAN J,et al.Saliency-guided attention network for image-sentence matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:5754-5763. [85] NEUMANN M,VU N T.Attentive convolutional neural network based speech emotion recognition:a study on the impact of input features,signal length,and acted speech[J].arXiv:1706.00612,2017. [86] RAMET G,GARNER P N,BAERISWYL M,et al.Context-aware attention mechanism for speech emotion recognition[C]//2018 IEEE Spoken Language Technology Workshop,2018:126-131. [87] TARANTINO L,GARNER P N,LAZARIDIS A.Self-attention for speech emotion recognition[C]//Proceedings of INTERSPEECH,2019:2578-2582. [88] GAO J,LYU T,XIONG F,et al.Mgnn:a multimodal graph neural network for predicting the survival of cancer patients[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval,2020:1697-1700. [89] PAN S J,QIANG Y.A survey on transfer learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359. |
[1] | HU Xinjue, FU Zhangjie. Hiding Two Images with High Visual Quality [J]. Computer Engineering and Applications, 2023, 59(4): 235-242. |
[2] | GAN Yating, AN Jianye, XU Xue. Survey of Short Text Classification Methods Based on Deep Learning [J]. Computer Engineering and Applications, 2023, 59(4): 43-53. |
[3] | YANG Kunrong, XIONG Yu, ZHANG Jian, CHU Wen. Research on MOOC Dropout Prediction Strategy for Long- and Short-Term Mixed Data [J]. Computer Engineering and Applications, 2023, 59(4): 130-138. |
[4] | LI Ling, GUO Guangsong. Hybrid Many-Objective Evolutionary Optimization Combined with Indexs Decomposition [J]. Computer Engineering and Applications, 2023, 59(4): 165-174. |
[5] | ZHANG Han, ZHENG Weihao, DOU Zhicheng, WEN Jirong. Integrating Multi-Layer Structure Information of Law for Legal Judgement Prediction [J]. Computer Engineering and Applications, 2023, 59(3): 253-263. |
[6] | YANG Hanyu, ZHAO Xiaoyong, WANG Lei. Review of Data Normalization Methods [J]. Computer Engineering and Applications, 2023, 59(3): 13-22. |
[7] | CHEN Xiaoting, LI Shi. Survey on Emotion Recognition in Conversation [J]. Computer Engineering and Applications, 2023, 59(3): 33-48. |
[8] | DU Yuzheng, CAO Hui, NIE Yongqi, WEI Dejian, FENG Yanyan. Application of Deep Learning in Classification and Diagnosis of Alzheimer's Disease [J]. Computer Engineering and Applications, 2023, 59(3): 49-65. |
[9] | LIN Honghui, LIU Jianhua, ZHENG Zhixiong, HU Renyuan, LUO Yixuan. Multi-Task Network for Joint Dialog Act Recognition and Sentiment Classification [J]. Computer Engineering and Applications, 2023, 59(3): 104-111. |
[10] | DING Shangshang, ZHENG Tianli, YAO Kang, ZHANG Hetong, PEI Ronghao, FU Weiwei. Deep-Learning-Based Research on Refractive Detection [J]. Computer Engineering and Applications, 2023, 59(3): 193-201. |
[11] | ZHANG Dongdong, GUO Jie, CHEN Yang. 3D Object Detection Algorithm Based on Raw Point Clouds [J]. Computer Engineering and Applications, 2023, 59(3): 209-217. |
[12] | LIN Lingde, LIU Na, WANG Zheng'an. Review of Research on Adapter and Prompt Tuning [J]. Computer Engineering and Applications, 2023, 59(2): 12-21. |
[13] | PEI Wenbin, WANG Hailong, LIU Lin, PEI Dongmei. Review of Musical Instrument Recognition in Music Information Retrieval [J]. Computer Engineering and Applications, 2023, 59(2): 34-47. |
[14] | WEI Shihong, LIU Hongmei, TANG Hong, ZHU Longjiao. Multilevel Metric Networks for Few-Shot Learning [J]. Computer Engineering and Applications, 2023, 59(2): 94-101. |
[15] | YANG Xiuzhang, WU Shuai, YANG Qi, XIANG Meiyu, LI Na, ZHOU Jisong, ZHAO Xiaoming. Automatic Paper Recommendation Algorithm Based on Multi-View Fusion TextRCNN [J]. Computer Engineering and Applications, 2023, 59(2): 110-119. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||