Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (21): 1-10.DOI: 10.3778/j.issn.1002-8331.2002-0342

Previous Articles     Next Articles

Survey of Multimodal Deep Learning

SUN Yingying, JIA Zhentang, ZHU Haoyu   

  1. College of Electronics and Information Engineering, Shanghai University of Electric Power, Shanghai 200090, China
  • Online:2020-11-01 Published:2020-11-03



  1. 上海电力大学 电子与信息工程学院,上海 200090


Modal refers to the way people receive information, including hearing, vision, smell, touch and other ways. Multimodal learning refers to learning better feature representation by using the complementarity between multimodes and eliminating the redundancy between them. The purpose of multimodal learning is to build a model that can deal with and correlate information from multiple modes. It is a dynamic multidisciplinary field, with increasing importance and great potential. At present, the popular research direction is multimodal learning among image, video, audio and text. This paper focuses on the application of multimodality in audio-visual speech recognition, image and text emotion analysis, collaborative annotation and other practical levels, as well as the application in the core level of matching and classification, alignment representation learning, and gives an explanation for the core issues of multimodal learning:matching and classification, alignment representation learning. Finally, the common data sets in multimodal learning are introduced, and the development trend of multimodal learning in the future is prospected.

Key words: multimodal learning, multimodal application, multimodal fusion, shared representation space



关键词: 多模态学习, 多模态应用, 多模态融合, 共享表示空间