Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (4): 1-20.DOI: 10.3778/j.issn.1002-8331.2306-0382

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey on Video-Text Cross-Modal Retrieval

CHEN Lei, XI Yimeng, LIU Libo   

  1. School of Information Engineering, Ningxia University, Yinchuan 750021, China
  • Online:2024-02-15 Published:2024-02-15

视频文本跨模态检索研究综述

陈磊,习怡萌,刘立波   

  1. 宁夏大学 信息工程学院,银川 750021

Abstract: Modalities define the specific forms in which data exist. The swift expansion of various modal data types has brought multimodal learning into the limelight. As a crucial subset of this field, cross-modal retrieval has achieved noteworthy advancements, particularly in integrating images and text. However, videos, as opposed to images, encapsulate a richer array of modal data and offer a more extensive spectrum of information. This richness aligns well with the growing user demand for comprehensive and adaptable information retrieval solutions. Consequently, video-text cross-modal retrieval has emerged as a burgeoning area of research in recent times. To thoroughly comprehend video-text cross-modal retrieval and its state-of-the-art developments, a methodical review and summarization of the existing representative methods is conducted. Initially, the focus is on analyzing current deep learning-based unidirectional and bidirectional video-text cross-modal retrieval methods. This analysis includes an in-depth exploration of seminal works within each category, highlighting their strengths and weaknesses. Subsequently, the discussion shifts to an experimental viewpoint, introducing benchmark datasets and evaluation metrics specific to video-text cross-modal retrieval. The performance of several standard methods in benchmark datasets is compared. Finally, the application prospects and future research challenges of video- text cross-modal retrieval are discussed.

Key words: multi-modality, cross-modal retrieval, deep learning, feature extraction

摘要: 模态代表着数据特定的存在形式,不同模态数据的快速增长,使得多模态学习受到广泛关注。跨模态检索作为多模态学习的一个重要分支,在图文方面已得到显著发展。然而视频相对于图像而言承载了更多模态的数据,也包含更广泛的信息,能够满足用户对信息检索全面性、灵活性的要求,近年来逐渐成为跨模态检索的研究热点。为全面认识和理解视频文本跨模态检索及其前沿工作,对现有代表性方法进行了梳理和综述。首先归纳分析了当前基于深度学习的单向、双向视频文本跨模态检索方法,对每类方法中的经典工作进行了详细分析并阐述了优缺点。接着从实验的角度给出视频文本跨模态检索的基准数据集和评价指标,并在多个常用基准数据集上比较了一些典型方法的性能。最后讨论了视频文本跨模态检索的应用前景、待解决问题及未来研究挑战。

关键词: 多模态, 跨模态检索, 深度学习, 特征提取