多模态深度学习综述

doi:10.3778/j.issn.1002-8331.2002-0342

计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (21): 1-10.DOI: 10.3778/j.issn.1002-8331.2002-0342

多模态深度学习综述

孙影影，贾振堂，朱昊宇

上海电力大学电子与信息工程学院，上海 200090

出版日期:2020-11-01 发布日期:2020-11-03

Survey of Multimodal Deep Learning

SUN Yingying, JIA Zhentang, ZHU Haoyu

College of Electronics and Information Engineering, Shanghai University of Electric Power, Shanghai 200090, China

Online:2020-11-01 Published:2020-11-03

摘要/Abstract

摘要：

模态是指人接收信息的方式，包括听觉、视觉、嗅觉、触觉等多种方式。多模态学习是指通过利用多模态之间的互补性，剔除模态间的冗余性，从而学习到更好的特征表示。多模态学习的目的是建立能够处理和关联来自多种模式信息的模型，它是一个充满活力的多学科领域，具有日益重要和巨大的潜力。目前比较热门的研究方向是图像、视频、音频、文本之间的多模态学习。着重介绍了多模态在视听语音识别、图文情感分析、协同标注等实际层面的应用，以及在匹配和分类、对齐表示学习等核心层面的应用，并针对多模态学习的核心问题：匹配和分类、对齐表示学习方面给出了说明。对多模态学习中常用的数据集进行了介绍，并展望了未来多模态学习的发展趋势。

关键词: 多模态学习, 多模态应用, 多模态融合, 共享表示空间

Abstract:

Modal refers to the way people receive information, including hearing, vision, smell, touch and other ways. Multimodal learning refers to learning better feature representation by using the complementarity between multimodes and eliminating the redundancy between them. The purpose of multimodal learning is to build a model that can deal with and correlate information from multiple modes. It is a dynamic multidisciplinary field, with increasing importance and great potential. At present, the popular research direction is multimodal learning among image, video, audio and text. This paper focuses on the application of multimodality in audio-visual speech recognition, image and text emotion analysis, collaborative annotation and other practical levels, as well as the application in the core level of matching and classification, alignment representation learning, and gives an explanation for the core issues of multimodal learning：matching and classification, alignment representation learning. Finally, the common data sets in multimodal learning are introduced, and the development trend of multimodal learning in the future is prospected.

Key words: multimodal learning, multimodal application, multimodal fusion, shared representation space

孙影影，贾振堂，朱昊宇. 多模态深度学习综述[J]. 计算机工程与应用, 2020, 56(21): 1-10.

SUN Yingying, JIA Zhentang, ZHU Haoyu. Survey of Multimodal Deep Learning[J]. Computer Engineering and Applications, 2020, 56(21): 1-10.

[1]	王传昱，李为相，陈震环. 基于语音和视频图像的多模态情感识别研究[J]. 计算机工程与应用, 2021, 57(23): 163-170.
[2]	任泽裕，王振超，柯尊旺，李哲，吾守尔·斯拉木. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64.
[3]	陈墨，郭雷. 多媒体情感标签标注中音频信号重要性分析[J]. 计算机工程与应用, 2018, 54(9): 1-4.

多模态深度学习综述

Survey of Multimodal Deep Learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics