%0 Journal Article %A SUN Yingying %A JIA Zhentang %A ZHU Haoyu %T Survey of Multimodal Deep Learning %D 2020 %R 10.3778/j.issn.1002-8331.2002-0342 %J Computer Engineering and Applications %P 1-10 %V 56 %N 21 %X

Modal refers to the way people receive information, including hearing, vision, smell, touch and other ways. Multimodal learning refers to learning better feature representation by using the complementarity between multimodes and eliminating the redundancy between them. The purpose of multimodal learning is to build a model that can deal with and correlate information from multiple modes. It is a dynamic multidisciplinary field, with increasing importance and great potential. At present, the popular research direction is multimodal learning among image, video, audio and text. This paper focuses on the application of multimodality in audio-visual speech recognition, image and text emotion analysis, collaborative annotation and other practical levels, as well as the application in the core level of matching and classification, alignment representation learning, and gives an explanation for the core issues of multimodal learning:matching and classification, alignment representation learning. Finally, the common data sets in multimodal learning are introduced, and the development trend of multimodal learning in the future is prospected.

%U http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2002-0342