Survey on Video-Text Cross-Modal Retrieval

doi:10.3778/j.issn.1002-8331.2306-0382

Abstract

Abstract: Modalities define the specific forms in which data exist. The swift expansion of various modal data types has brought multimodal learning into the limelight. As a crucial subset of this field, cross-modal retrieval has achieved noteworthy advancements, particularly in integrating images and text. However, videos, as opposed to images, encapsulate a richer array of modal data and offer a more extensive spectrum of information. This richness aligns well with the growing user demand for comprehensive and adaptable information retrieval solutions. Consequently, video-text cross-modal retrieval has emerged as a burgeoning area of research in recent times. To thoroughly comprehend video-text cross-modal retrieval and its state-of-the-art developments, a methodical review and summarization of the existing representative methods is conducted. Initially, the focus is on analyzing current deep learning-based unidirectional and bidirectional video-text cross-modal retrieval methods. This analysis includes an in-depth exploration of seminal works within each category, highlighting their strengths and weaknesses. Subsequently, the discussion shifts to an experimental viewpoint, introducing benchmark datasets and evaluation metrics specific to video-text cross-modal retrieval. The performance of several standard methods in benchmark datasets is compared. Finally, the application prospects and future research challenges of video- text cross-modal retrieval are discussed.

Key words: multi-modality, cross-modal retrieval, deep learning, feature extraction

摘要： 模态代表着数据特定的存在形式，不同模态数据的快速增长，使得多模态学习受到广泛关注。跨模态检索作为多模态学习的一个重要分支，在图文方面已得到显著发展。然而视频相对于图像而言承载了更多模态的数据，也包含更广泛的信息，能够满足用户对信息检索全面性、灵活性的要求，近年来逐渐成为跨模态检索的研究热点。为全面认识和理解视频文本跨模态检索及其前沿工作，对现有代表性方法进行了梳理和综述。首先归纳分析了当前基于深度学习的单向、双向视频文本跨模态检索方法，对每类方法中的经典工作进行了详细分析并阐述了优缺点。接着从实验的角度给出视频文本跨模态检索的基准数据集和评价指标，并在多个常用基准数据集上比较了一些典型方法的性能。最后讨论了视频文本跨模态检索的应用前景、待解决问题及未来研究挑战。

关键词: 多模态, 跨模态检索, 深度学习, 特征提取

CHEN Lei, XI Yimeng, LIU Libo. Survey on Video-Text Cross-Modal Retrieval[J]. Computer Engineering and Applications, 2024, 60(4): 1-20.

陈磊, 习怡萌, 刘立波. 视频文本跨模态检索研究综述[J]. 计算机工程与应用, 2024, 60(4): 1-20.

References

[1] KAUR P, PANNU H S, MALHI A K. Comparative analysis on cross-modal information retrieval: a review[J]. Computer Science Review, 2021, 39: 100336.
[2] BAIN M, NAGRANI A, VAROL G, et al. Frozen in time: a joint video and image encoder for end-to-end retrieval[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 1728-1738.
[3] GE Y Y, GE Y X, LIU X H, et al. Bridging video-text retrieval with multiple choice questions[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 16167-16176.
[4] PEREZ M J, BUSTOS B, GUIMARES S J F, et al. A comprehensive review of the video-to-text problem[J]. Artificial Intelligence Review, 2022, 55: 4165-4239.
[5] 彭宇新, 綦金玮, 黄鑫. 多媒体内容理解的研究现状与展望[J]. 计算机研究与发展, 2019, 56(1): 183-208.
      PENG Y Y, QI J W, HUANG X. Current research status and prospects on multimedia content understanding[J]. Journal of Computer Research and Development, 2019, 56(1): 183-208.
[6] 尹奇跃, 黄岩, 张俊格, 等. 基于深度学习的跨模态检索综述[J]. 中国图象图形学报, 2021, 26(6): 1368-1388.
     YIN Q Y, HUANG Y, ZHANG J G, et al. 2021. Survey on deep learning based cross-modal retrieval[J]. Journal of Image and Graphics, 26(6): 1368-1388.
[7] 黄立, 朱定局. 基于语义的视频检索技术综述[J]. 计算机系统应用, 2021, 30(8): 14-21.
      HUANG L, ZHU D J. Review on semantic-based video retrieval technology[J]. Computer Systems & Applications, 2021, 30(8): 14-21.
[8] 赵瑞. 基于深度学习的视频-文本跨模态搜索[D]. 合肥: 中国科学技术大学, 2020.
     ZHAO R. Deep learning based video-text cross-modal retrieval[D]. Hefei: University of Science and Technology of China, 2020.
[9] PATEL B V, MESHRAM B B. Content based video retrieval[J]. arXiv: 1211. 4683, 2012.
[10] HU W, XIE N, LI L, et al. A survey on visual content-based video indexing and retrieval[J]. IEEE Transactions on Systems, Man, and Cybernetics: Part C (Applications and Reviews), 2011, 41(6): 797-819.
[11] LIU Y, ALBANIE S, NAGRANI A, et al. Use what you have: video retrieval using representations from collaborative experts[J]. arXiv: 1907. 13487, 2019.
[12] GABEUR V, SUN C, ALAHARI K, et al. Multi-modal transformer for video retrieval[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 214-229.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017.
[14] GING S, ZOLFAGHARI M, PIRSIAVASH H, et al. COOT: cooperative hierarchical transformer for video-text representation learning[C]//Advances in Neural Information Processing Systems 33, 2020: 22605-22618.
[15] KUNITSYN A, KALASHNIKOV M, DZABRAEV M, et al. MDMMT-2: multidomain multimodal transformer for video retrieval, one more step towards generalization[J]. arXiv: 2203. 07086, 2022.
[16] BOGOLIN S, CROITORU I, JIN H, et al. Cross modal retrieval with querybank normalization[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[17] XIAO S, CHEN L, SHAO J, et al. Natural language video localization with learnable moment proposals[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
[18] LEI J, YU L, BERG T L, et al. TVR: a large-scale dataset for video-subtitle moment retrieval[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 447-463.
[19] MIECH A, ZHUKOV D, ALAYRAC J B, et al. Howto100m: learning a text-video embedding by watching hundred million narrated video clips[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 2630-2640.
[20] SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 7464-7473.
[21] SHVETSOVA N, CHEN B, ROUDITCHENKO A, et al. Everything at once-multi-modal fusion transformer for video retrieval[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20020-20029.
[22] LEI J, LI L, ZHOU L, et al. Less is more: ClipBERT for video-and-language learning via sparse sampling[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 7331-7341.
[23] GAO J, SUN C, YANG Z, et al. TALL: temporal activity localization via language query[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 5267-5275.
[24] ANNE H L, WANG O, SHECHTMAN E, et al. Localizing moments in video with natural language[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 5803-5812.
[25] LIU M, WANG X, NIE L, et al. Cross-modal moment localization in videos[C]//Proceedings of the 26th ACM International Conference on Multimedia, 2018: 843-851.
[26] LIU M, WANG X, NIE L, et al. Attentive moment retrieval in videos[C]//Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018: 15-24.
[27] GE R, GAO J, CHEN K, et al. MAC: mining activity concepts for language-based temporal localization[C]//Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision, 2019: 245-253.
[28] XU H, HE K, SIGAL L, et al. Text-to-clip video retrieval with early fusion and re-captioning[J]. arXiv: 1804. 05113, 2018.
[29] XU H, HE K, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019: 9062-9069.
[30] CHEN S, JIANG Y G. Semantic proposal for activity localization in videos via sentence query[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019: 8199-8206.
[31] CHEN J, CHEN X, MA L, et al. Temporally grounding natural sentence in video[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 162-171.
[32] ZHANG D, DAI X, WANG X, et al. MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1247-1257.
[33] YUAN Y, MA L, WANG J, et al. Semantic conditioned dynamic modulation for temporal sentence grounding in videos[C]//Advances in Neural Information Processing Systems 32, 2019.
[34] YU A W, DOHAN D, LE Q, et al. Fast and accurate reading comprehension by combining self-attention and convolution[C]//Proceedings of the 2018 International Conference on Learning Representations, 2018.
[35] CHEN J, MA L, CHEN X, et al. Localizing natural language in videos[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019: 8175-8182.
[36] GHOSH S, AGARWAL A, PAREKH Z, et al. ExCL: extractive clip localization using natural language descriptions[J]. arXiv: 1904. 02755, 2019.
[37] ZHANG H, SUN A, JING W, et al. Span-based localizing network for natural language video localization[J]. arXiv: 2004. 13931, 2020.
[38] KHOLLAM R, SINGH S P. A survey on content-based lecture video retrieval using speech and video text information[J]. International Journal of Science and Research, 2015: 2319-7064.
[39] AWAD G, FISCUS J, JOY D, et al. TRECVID 2016: evaluating video search, video event detection, localization, and hyperlinking[C]//Proceedings of the TREC Video Retrieval Evaluation, 2016.
[40] ZHANG H J, WU J, ZHONG D, et al. An integrated system for content-based video retrieval and browsing[J]. Pattern Recognition, 1997, 30(4): 643-658.
[41] YU X D, WANG L, TIAN Q, et al. Multilevel video representation with application to keyframe extraction[C]//Proceedings of the 10th International Multimedia Modelling Conference, 2004: 117-123.
[42] GIBSON D, CAMPBELL N, THOMAS B. Visual abstraction of wildlife footage using Gaussian mixture models and the minimum description length criterion[C]//Proceedings of the 2002 International Conference on Pattern Recognition, 2002: 814-817.
[43] KO K C, CHEON Y M, KIM G Y, et al. Video shot boundary detection algorithm[M]//Computer vision, graphics and image processing. Berlin, Heidelberg: Springer, 2006: 388-396.
[44] CHEN C Y, WANG J C, WANG J F. Efficient news video querying and browsing based on distributed news video servers[J]. IEEE Transactions on Multimedia, 2006, 8(2): 257-269.
[45] LE T L, BOUCHER A, THONNAT M. An interface for image retrieval and its extension to video retrieval[C]//Proceedings of the 3rd National Symposium on Research, Development and Application of Information and Communication Technology, 2006.
[46] SIVIC J, ZISSERMAN A. Video data mining using configurations of viewpoint invariant regions[C]//Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004.
[47] ANJULAN A, CANAGARAJAH N. A unified framework for object retrieval and mining[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 19(1): 63-76.
[48] FAN J, LUO H, GAO Y, et al. Incorporating concept ontology to boost hierarchical classifier training for automatic multi-level video annotation[J]. IEEE Transactions on Multimedia, 2007, 9(5): 939-957.
[49] DONG J, LI X, SNOEK C G M. Predicting visual features from text for image and video caption retrieval[J]. IEEE Transactions on Multimedia, 2018, 20(12): 3377-3388.
[50] LI X, XU C, YANG G, et al. W2VV++ fully deep learning for ad-hoc video search[C]//Proceedings of the 27th ACM International Conference on Multimedia, 2019: 1786-1794.
[51] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C]//Proceedings of the 2018 ACM International Conference on Multimedia Retrieval, 2018: 19-27.
[52] AYTAR Y, VONDRICK C, TORRALBA A. SoundNet: learning sound representations from unlabeled video[C]//Advances in Neural Information Processing Systems 29, 2016.
[53] DZABRAEV M, KALASHNIKOV M, KOMKOV S, et al. MDMMT: multidomain multimodal transformer for video retrieval[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 3354-3363.
[54] GABEUR V, NAGRANI A, SUN C, et al. Masking modalities for cross-modal video retrieval[C]//Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, 2022: 1766-1775.
[55] CHEN S, ZHAO Y, JIN Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10638-10647.
[56] MIECH A, LAPTEV I, SIVIC J. Learning a text-video embedding from incomplete and heterogeneous data[J]. arXiv: 1804. 02516, 2018.
[57] LUO H, JI L, SHI B, et al. UniVL: a unified video and language pre-training model for multimodal understanding and generation[J]. arXiv: 2002. 06353, 2020.
[58] ZELLERS R, LU X, HESSEL J, et al. MERLOT: multimodal neural script knowledge models[C]//Advances in Neural Information Processing Systems 34, 2021: 23634-23651.
[59] FU T J, LI L, GAN Z, et al. VIOLET: end-to-end video-language transformers with masked visual-token modeling[J]. arXiv: 2111.12681, 2021.
[60] RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8821-8831.
[61] WANG X, ZHU L, YANG Y. T2VLAD: global-local sequence alignment for text-video retrieval[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 5079-5088.
[62] ZHU L, YANG Y. ActBERT: learning global-local video-text representations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8746-8755.
[63] HUANG J, LI Y, FENG J, et al. Clover: towards a unified video-language alignment and fusion model[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 14856-14866.
[64] LUO J, LI Y, PAN Y, et al. CoCo-BERT: improving video-language pre-training with contrastive cross-modal matching and denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 5600-5608.
[65] LI L, CHEN Y C, CHENG Y, et al. HERO: hierarchical encoder for video+ language omni-representation pre-training[J]. arXiv: 2005. 00200, 2020.
[66] PATRICK M, HUANG P Y, ASANO Y, et al. Support-set bottlenecks for video-text representation learning[J]. arXiv: 2010. 02824, 2020.
[67] GUTMANN M, HYV?RINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 297-304.
[68] WEN K, XIA J, HUANG Y, et al. COOKIE: contrastive cross-modal knowledge sharing pre-training for vision-language representation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 2208-2217.
[69] ZOLFAGHARI M, ZHU Y, GEHLER P, et al. CrossCLR: cross-modal contrastive learning for multi-modal video representations[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 1450-1459.
[70] ZHAO N, JIAO J, XIE W, et al. Cali-NCE: boosting cross-modal video representation learning with calibrated alignment[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 6316-6326.
[71] MIECH A, ALAYRAC J B, SMAIRA L, et al. End-to-end learning of visual representations from uncurated instructional videos[C]//Proceedings of the 2020 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2020: 9879-9889.
[72] WANG J, CHEN B, LIAO D, et al. Hybrid contrastive quantization for efficient cross-view video retrieval[J]. arXiv: 2202. 03384, 2022.
[73] MA W, CHEN Q, ZHOU T, et al. Using multimodal contrastive knowledge distillation for video-text retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(10): 5486-5497.
[74] YANG J, BISK Y, GAO J. TACo: token-aware cascade contrastive learning for video-text alignment[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 11562-11572.
[75] LIU S, FAN H, QIAN S, et al. HiT: hierarchical transformer with momentum contrast for video-text retrieval[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 11915-11925.
[76] MA Y, XU G, SUN X, et al. X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 638-647.
[77] CHENG X, LIN H, WU X, et al. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss[J]. arXiv: 2109. 04290, 2021.
[78] WANG Z, CHEN A, HU F, et al. Learn to understand negation in video retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 434-443.
[79] WANG Y, DONG J, LIANG T, et al. Cross-lingual cross-modal retrieval with noise-robust learning[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 422-433.
[80] XU R, LI C, YAN J, et al. Graph convolutional network hashing for cross-modal retrieval[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 982-988.
[81] JIN W, ZHAO Z, ZHANG P, et al. Hierarchical cross-modal graph consistency learning for video-text retrieval[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021: 1114-1124.
[82] WRAY M, LARLUS D, CSURKA G, et al. Fine-grained action retrieval through multiple parts-of-speech embeddings[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 450-459.
[83] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 706-715.
[84] ZHOU L, XU C, CORSO J J. Towards automatic learning of procedures from web instructional videos[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018.
[85] CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 190-200.
[86] XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 5288-5296.
[87] LI Y, SONG Y, CAO L, et al. TGIF: a new dataset and benchmark on animated GIF description[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4641-4650.
[88] ROHRBACH A, TORABI A, ROHRBACH M, et al. Movie description[J]. International Journal of Computer Vision, 2017, 123(1): 94-120.
[89] WANG X, WU J, CHEN J, et al. VATEX: a large-scale, high-quality multilingual dataset for video-and-language research[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 4581-4591.
[90] XIONG C, ZHANG D, LIU T, et al. Voice-face cross-modal matching and retrieval: a benchmark[J]. arXiv: 1911.09338, 2019.
[91] HE X, PENG Y, XIE L. A new benchmark and approach for fine-grained cross-media retrieval[C]//Proceedings of the 27th ACM International Conference on Multimedia, 2019: 1740-1748.