Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (21): 1-10.DOI: 10.3778/j.issn.1002-8331.2002-0342

Previous Articles     Next Articles

Survey of Multimodal Deep Learning

SUN Yingying, JIA Zhentang, ZHU Haoyu   

  1. College of Electronics and Information Engineering, Shanghai University of Electric Power, Shanghai 200090, China
  • Online:2020-11-01 Published:2020-11-03

多模态深度学习综述

孙影影,贾振堂,朱昊宇   

  1. 上海电力大学 电子与信息工程学院,上海 200090

Abstract:

Modal refers to the way people receive information, including hearing, vision, smell, touch and other ways. Multimodal learning refers to learning better feature representation by using the complementarity between multimodes and eliminating the redundancy between them. The purpose of multimodal learning is to build a model that can deal with and correlate information from multiple modes. It is a dynamic multidisciplinary field, with increasing importance and great potential. At present, the popular research direction is multimodal learning among image, video, audio and text. This paper focuses on the application of multimodality in audio-visual speech recognition, image and text emotion analysis, collaborative annotation and other practical levels, as well as the application in the core level of matching and classification, alignment representation learning, and gives an explanation for the core issues of multimodal learning:matching and classification, alignment representation learning. Finally, the common data sets in multimodal learning are introduced, and the development trend of multimodal learning in the future is prospected.

Key words: multimodal learning, multimodal application, multimodal fusion, shared representation space

摘要:

模态是指人接收信息的方式,包括听觉、视觉、嗅觉、触觉等多种方式。多模态学习是指通过利用多模态之间的互补性,剔除模态间的冗余性,从而学习到更好的特征表示。多模态学习的目的是建立能够处理和关联来自多种模式信息的模型,它是一个充满活力的多学科领域,具有日益重要和巨大的潜力。目前比较热门的研究方向是图像、视频、音频、文本之间的多模态学习。着重介绍了多模态在视听语音识别、图文情感分析、协同标注等实际层面的应用,以及在匹配和分类、对齐表示学习等核心层面的应用,并针对多模态学习的核心问题:匹配和分类、对齐表示学习方面给出了说明。对多模态学习中常用的数据集进行了介绍,并展望了未来多模态学习的发展趋势。

关键词: 多模态学习, 多模态应用, 多模态融合, 共享表示空间