Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (2): 48-64.DOI: 10.3778/j.issn.1002-8331.2206-0145

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey of Research on Deep Multimodal Representation Learning

PAN Mengzhu, LI Qianmu, QIU Tian   

  1. School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
  • Online:2023-01-15 Published:2023-01-15



  1. 南京理工大学 计算机科学与工程学院,南京 210094

Abstract: Although deep learning has been widely used in many fields because of its powerful nonlinear representation capabilities, the structural and semantic gap between multi-source heterogeneous modal data seriously hinders the application of subsequent deep learning models. Many scholars have proposed a large number of representation learning methods to explore the correlation and complementarity between different modalities, and improve the performance of deep learning prediction and generalization. However, the research on multimodal representation learning is still in its infancy, and there are still many scientific problems to be solved. So far, multimodal representation learning still lacks a unified cognition, and the architecture and evaluation metrics of multimodal representation learning research are not fully clear. According to the feature structure, semantic information and representation ability of different modalities, this paper studies and analyzes the progress of deep multimodal representation learning from the perspectives of representation fusion and representation alignment. And the existing research work is systematically summarized and scientifically classified. At the same time, this paper analyzes the basic structure, application scenarios and key issues of representative frameworks and models, analyzes the theoretical basis and latest development of deep multimodal representation learning, and points out the current challenges and future development of multimodal representation learning research, to further promote the development and application of deep multimodal representation learning.

Key words: multimodal representation, deep learning, multimodal fusion, multimodal alignment

摘要: 尽管深度学习因为强大的非线性表示能力已广泛应用于许多领域,多源异构模态数据间结构和语义上的鸿沟严重阻碍了后续深度学习模型的应用。虽然已经有许多学者提出了大量的表示学习方法以探索不同模态间的相关性和互补性,并提高深度学习预测和泛化性能。然而,多模态表示学习研究还处于初级阶段,依然存在许多科学问题尚需解决。迄今为止,多模态表示学习仍缺乏统一的认知,多模态表示学习研究的体系结构和评价指标尚不完全明确。根据不同模态的特征结构、语义信息和表示能力,从表示融合和表示对齐两个角度研究和分析了深度多模态表示学习的进展,并对现有研究工作进行了系统的总结和科学的分类。同时,解析了代表性框架和模型的基本结构、应用场景和关键问题,分析了深度多模态表示学习的理论基础和最新发展,并且指出了多模态表示学习研究当前面临的挑战和今后的发展趋势,以进一步推动深度多模态表示学习的发展和应用。

关键词: 多模态表示, 深度学习, 多模态融合, 多模态对齐