Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (4): 1-20.DOI: 10.3778/j.issn.1002-8331.2306-0382
• Research Hotspots and Reviews • Previous Articles Next Articles
CHEN Lei, XI Yimeng, LIU Libo
Online:
2024-02-15
Published:
2024-02-15
陈磊,习怡萌,刘立波
CHEN Lei, XI Yimeng, LIU Libo. Survey on Video-Text Cross-Modal Retrieval[J]. Computer Engineering and Applications, 2024, 60(4): 1-20.
陈磊, 习怡萌, 刘立波. 视频文本跨模态检索研究综述[J]. 计算机工程与应用, 2024, 60(4): 1-20.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2306-0382
[1] KAUR P, PANNU H S, MALHI A K. Comparative analysis on cross-modal information retrieval: a review[J]. Computer Science Review, 2021, 39: 100336. [2] BAIN M, NAGRANI A, VAROL G, et al. Frozen in time: a joint video and image encoder for end-to-end retrieval[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 1728-1738. [3] GE Y Y, GE Y X, LIU X H, et al. Bridging video-text retrieval with multiple choice questions[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 16167-16176. [4] PEREZ M J, BUSTOS B, GUIMARES S J F, et al. A comprehensive review of the video-to-text problem[J]. Artificial Intelligence Review, 2022, 55: 4165-4239. [5] 彭宇新, 綦金玮, 黄鑫. 多媒体内容理解的研究现状与展望[J]. 计算机研究与发展, 2019, 56(1): 183-208. PENG Y Y, QI J W, HUANG X. Current research status and prospects on multimedia content understanding[J]. Journal of Computer Research and Development, 2019, 56(1): 183-208. [6] 尹奇跃, 黄岩, 张俊格, 等. 基于深度学习的跨模态检索综述[J]. 中国图象图形学报, 2021, 26(6): 1368-1388. YIN Q Y, HUANG Y, ZHANG J G, et al. 2021. Survey on deep learning based cross-modal retrieval[J]. Journal of Image and Graphics, 26(6): 1368-1388. [7] 黄立, 朱定局. 基于语义的视频检索技术综述[J]. 计算机系统应用, 2021, 30(8): 14-21. HUANG L, ZHU D J. Review on semantic-based video retrieval technology[J]. Computer Systems & Applications, 2021, 30(8): 14-21. [8] 赵瑞. 基于深度学习的视频-文本跨模态搜索[D]. 合肥: 中国科学技术大学, 2020. ZHAO R. Deep learning based video-text cross-modal retrieval[D]. Hefei: University of Science and Technology of China, 2020. [9] PATEL B V, MESHRAM B B. Content based video retrieval[J]. arXiv: 1211. 4683, 2012. [10] HU W, XIE N, LI L, et al. A survey on visual content-based video indexing and retrieval[J]. IEEE Transactions on Systems, Man, and Cybernetics: Part C (Applications and Reviews), 2011, 41(6): 797-819. [11] LIU Y, ALBANIE S, NAGRANI A, et al. Use what you have: video retrieval using representations from collaborative experts[J]. arXiv: 1907. 13487, 2019. [12] GABEUR V, SUN C, ALAHARI K, et al. Multi-modal transformer for video retrieval[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 214-229. [13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017. [14] GING S, ZOLFAGHARI M, PIRSIAVASH H, et al. COOT: cooperative hierarchical transformer for video-text representation learning[C]//Advances in Neural Information Processing Systems 33, 2020: 22605-22618. [15] KUNITSYN A, KALASHNIKOV M, DZABRAEV M, et al. MDMMT-2: multidomain multimodal transformer for video retrieval, one more step towards generalization[J]. arXiv: 2203. 07086, 2022. [16] BOGOLIN S, CROITORU I, JIN H, et al. Cross modal retrieval with querybank normalization[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. [17] XIAO S, CHEN L, SHAO J, et al. Natural language video localization with learnable moment proposals[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. [18] LEI J, YU L, BERG T L, et al. TVR: a large-scale dataset for video-subtitle moment retrieval[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 447-463. [19] MIECH A, ZHUKOV D, ALAYRAC J B, et al. Howto100m: learning a text-video embedding by watching hundred million narrated video clips[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 2630-2640. [20] SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 7464-7473. [21] SHVETSOVA N, CHEN B, ROUDITCHENKO A, et al. Everything at once-multi-modal fusion transformer for video retrieval[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20020-20029. [22] LEI J, LI L, ZHOU L, et al. Less is more: ClipBERT for video-and-language learning via sparse sampling[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 7331-7341. [23] GAO J, SUN C, YANG Z, et al. TALL: temporal activity localization via language query[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 5267-5275. [24] ANNE H L, WANG O, SHECHTMAN E, et al. Localizing moments in video with natural language[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 5803-5812. [25] LIU M, WANG X, NIE L, et al. Cross-modal moment localization in videos[C]//Proceedings of the 26th ACM International Conference on Multimedia, 2018: 843-851. [26] LIU M, WANG X, NIE L, et al. Attentive moment retrieval in videos[C]//Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018: 15-24. [27] GE R, GAO J, CHEN K, et al. MAC: mining activity concepts for language-based temporal localization[C]//Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision, 2019: 245-253. [28] XU H, HE K, SIGAL L, et al. Text-to-clip video retrieval with early fusion and re-captioning[J]. arXiv: 1804. 05113, 2018. [29] XU H, HE K, PLUMMER B A, et al. Multilevel language and vision integration for text-to-clip retrieval[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019: 9062-9069. [30] CHEN S, JIANG Y G. Semantic proposal for activity localization in videos via sentence query[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019: 8199-8206. [31] CHEN J, CHEN X, MA L, et al. Temporally grounding natural sentence in video[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 162-171. [32] ZHANG D, DAI X, WANG X, et al. MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1247-1257. [33] YUAN Y, MA L, WANG J, et al. Semantic conditioned dynamic modulation for temporal sentence grounding in videos[C]//Advances in Neural Information Processing Systems 32, 2019. [34] YU A W, DOHAN D, LE Q, et al. Fast and accurate reading comprehension by combining self-attention and convolution[C]//Proceedings of the 2018 International Conference on Learning Representations, 2018. [35] CHEN J, MA L, CHEN X, et al. Localizing natural language in videos[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019: 8175-8182. [36] GHOSH S, AGARWAL A, PAREKH Z, et al. ExCL: extractive clip localization using natural language descriptions[J]. arXiv: 1904. 02755, 2019. [37] ZHANG H, SUN A, JING W, et al. Span-based localizing network for natural language video localization[J]. arXiv: 2004. 13931, 2020. [38] KHOLLAM R, SINGH S P. A survey on content-based lecture video retrieval using speech and video text information[J]. International Journal of Science and Research, 2015: 2319-7064. [39] AWAD G, FISCUS J, JOY D, et al. TRECVID 2016: evaluating video search, video event detection, localization, and hyperlinking[C]//Proceedings of the TREC Video Retrieval Evaluation, 2016. [40] ZHANG H J, WU J, ZHONG D, et al. An integrated system for content-based video retrieval and browsing[J]. Pattern Recognition, 1997, 30(4): 643-658. [41] YU X D, WANG L, TIAN Q, et al. Multilevel video representation with application to keyframe extraction[C]//Proceedings of the 10th International Multimedia Modelling Conference, 2004: 117-123. [42] GIBSON D, CAMPBELL N, THOMAS B. Visual abstraction of wildlife footage using Gaussian mixture models and the minimum description length criterion[C]//Proceedings of the 2002 International Conference on Pattern Recognition, 2002: 814-817. [43] KO K C, CHEON Y M, KIM G Y, et al. Video shot boundary detection algorithm[M]//Computer vision, graphics and image processing. Berlin, Heidelberg: Springer, 2006: 388-396. [44] CHEN C Y, WANG J C, WANG J F. Efficient news video querying and browsing based on distributed news video servers[J]. IEEE Transactions on Multimedia, 2006, 8(2): 257-269. [45] LE T L, BOUCHER A, THONNAT M. An interface for image retrieval and its extension to video retrieval[C]//Proceedings of the 3rd National Symposium on Research, Development and Application of Information and Communication Technology, 2006. [46] SIVIC J, ZISSERMAN A. Video data mining using configurations of viewpoint invariant regions[C]//Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. [47] ANJULAN A, CANAGARAJAH N. A unified framework for object retrieval and mining[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 19(1): 63-76. [48] FAN J, LUO H, GAO Y, et al. Incorporating concept ontology to boost hierarchical classifier training for automatic multi-level video annotation[J]. IEEE Transactions on Multimedia, 2007, 9(5): 939-957. [49] DONG J, LI X, SNOEK C G M. Predicting visual features from text for image and video caption retrieval[J]. IEEE Transactions on Multimedia, 2018, 20(12): 3377-3388. [50] LI X, XU C, YANG G, et al. W2VV++ fully deep learning for ad-hoc video search[C]//Proceedings of the 27th ACM International Conference on Multimedia, 2019: 1786-1794. [51] MITHUN N C, LI J, METZE F, et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval[C]//Proceedings of the 2018 ACM International Conference on Multimedia Retrieval, 2018: 19-27. [52] AYTAR Y, VONDRICK C, TORRALBA A. SoundNet: learning sound representations from unlabeled video[C]//Advances in Neural Information Processing Systems 29, 2016. [53] DZABRAEV M, KALASHNIKOV M, KOMKOV S, et al. MDMMT: multidomain multimodal transformer for video retrieval[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 3354-3363. [54] GABEUR V, NAGRANI A, SUN C, et al. Masking modalities for cross-modal video retrieval[C]//Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, 2022: 1766-1775. [55] CHEN S, ZHAO Y, JIN Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10638-10647. [56] MIECH A, LAPTEV I, SIVIC J. Learning a text-video embedding from incomplete and heterogeneous data[J]. arXiv: 1804. 02516, 2018. [57] LUO H, JI L, SHI B, et al. UniVL: a unified video and language pre-training model for multimodal understanding and generation[J]. arXiv: 2002. 06353, 2020. [58] ZELLERS R, LU X, HESSEL J, et al. MERLOT: multimodal neural script knowledge models[C]//Advances in Neural Information Processing Systems 34, 2021: 23634-23651. [59] FU T J, LI L, GAN Z, et al. VIOLET: end-to-end video-language transformers with masked visual-token modeling[J]. arXiv: 2111.12681, 2021. [60] RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 8821-8831. [61] WANG X, ZHU L, YANG Y. T2VLAD: global-local sequence alignment for text-video retrieval[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 5079-5088. [62] ZHU L, YANG Y. ActBERT: learning global-local video-text representations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8746-8755. [63] HUANG J, LI Y, FENG J, et al. Clover: towards a unified video-language alignment and fusion model[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 14856-14866. [64] LUO J, LI Y, PAN Y, et al. CoCo-BERT: improving video-language pre-training with contrastive cross-modal matching and denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia, 2021: 5600-5608. [65] LI L, CHEN Y C, CHENG Y, et al. HERO: hierarchical encoder for video+ language omni-representation pre-training[J]. arXiv: 2005. 00200, 2020. [66] PATRICK M, HUANG P Y, ASANO Y, et al. Support-set bottlenecks for video-text representation learning[J]. arXiv: 2010. 02824, 2020. [67] GUTMANN M, HYV?RINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010: 297-304. [68] WEN K, XIA J, HUANG Y, et al. COOKIE: contrastive cross-modal knowledge sharing pre-training for vision-language representation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 2208-2217. [69] ZOLFAGHARI M, ZHU Y, GEHLER P, et al. CrossCLR: cross-modal contrastive learning for multi-modal video representations[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 1450-1459. [70] ZHAO N, JIAO J, XIE W, et al. Cali-NCE: boosting cross-modal video representation learning with calibrated alignment[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 6316-6326. [71] MIECH A, ALAYRAC J B, SMAIRA L, et al. End-to-end learning of visual representations from uncurated instructional videos[C]//Proceedings of the 2020 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2020: 9879-9889. [72] WANG J, CHEN B, LIAO D, et al. Hybrid contrastive quantization for efficient cross-view video retrieval[J]. arXiv: 2202. 03384, 2022. [73] MA W, CHEN Q, ZHOU T, et al. Using multimodal contrastive knowledge distillation for video-text retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(10): 5486-5497. [74] YANG J, BISK Y, GAO J. TACo: token-aware cascade contrastive learning for video-text alignment[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 11562-11572. [75] LIU S, FAN H, QIAN S, et al. HiT: hierarchical transformer with momentum contrast for video-text retrieval[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 11915-11925. [76] MA Y, XU G, SUN X, et al. X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 638-647. [77] CHENG X, LIN H, WU X, et al. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss[J]. arXiv: 2109. 04290, 2021. [78] WANG Z, CHEN A, HU F, et al. Learn to understand negation in video retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 434-443. [79] WANG Y, DONG J, LIANG T, et al. Cross-lingual cross-modal retrieval with noise-robust learning[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 422-433. [80] XU R, LI C, YAN J, et al. Graph convolutional network hashing for cross-modal retrieval[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 982-988. [81] JIN W, ZHAO Z, ZHANG P, et al. Hierarchical cross-modal graph consistency learning for video-text retrieval[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021: 1114-1124. [82] WRAY M, LARLUS D, CSURKA G, et al. Fine-grained action retrieval through multiple parts-of-speech embeddings[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 450-459. [83] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 706-715. [84] ZHOU L, XU C, CORSO J J. Towards automatic learning of procedures from web instructional videos[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018. [85] CHEN D, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 190-200. [86] XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 5288-5296. [87] LI Y, SONG Y, CAO L, et al. TGIF: a new dataset and benchmark on animated GIF description[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4641-4650. [88] ROHRBACH A, TORABI A, ROHRBACH M, et al. Movie description[J]. International Journal of Computer Vision, 2017, 123(1): 94-120. [89] WANG X, WU J, CHEN J, et al. VATEX: a large-scale, high-quality multilingual dataset for video-and-language research[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 4581-4591. [90] XIONG C, ZHANG D, LIU T, et al. Voice-face cross-modal matching and retrieval: a benchmark[J]. arXiv: 1911.09338, 2019. [91] HE X, PENG Y, XIE L. A new benchmark and approach for fine-grained cross-media retrieval[C]//Proceedings of the 27th ACM International Conference on Multimedia, 2019: 1740-1748. |
[1] | JIN Tao, JIN Ran, HOU Tengda, YUAN Jie, GU Xiaozhe. Review of Research on Multimodal Retrieval [J]. Computer Engineering and Applications, 2024, 60(5): 62-75. |
[2] | WANG Haiqun, WANG Bingnan, GE Chao. Re-Parameterized YOLOv8 Pavement Disease Detection Algorithm [J]. Computer Engineering and Applications, 2024, 60(5): 191-199. |
[3] | WANG Rong, DUANMU Chunjiang. Multi-Coupled Feedback Networks for Image Fusion and Super-Resolution Methods [J]. Computer Engineering and Applications, 2024, 60(5): 210-220. |
[4] | XIE Ruobing, LI Maojun, LI Yiwei, HU Jianwen. Improving YOLOX-s Dense Garbage Detection Method [J]. Computer Engineering and Applications, 2024, 60(5): 250-258. |
[5] | SU Chenyang, WU Wenhong, NIU Hengmao, SHI Bao, HAO Xu, WANG Jiamin, GAO Le, WANG Weitai. Review of Deep Learning Approaches for Recognizing Multiple Unsafe Behaviors in Workers [J]. Computer Engineering and Applications, 2024, 60(5): 30-46. |
[6] | Abudukelimu Halidanmu, FENG Ke, SHI Yaqing, Abudukelimu Nihemaiti, Abulizi Abudukelimu. Review of Applications of Deep Learning in Fracture Diagnosis [J]. Computer Engineering and Applications, 2024, 60(5): 47-61. |
[7] | CHEN Zhaohong, HONG Zhiyong, YU Wenhua, ZHANG Xin. Extreme Multi-Label Text Classification Based on Balance Function [J]. Computer Engineering and Applications, 2024, 60(4): 163-172. |
[8] | ZHANG Jianrui, WEI Xia, ZHANG Linxuan, CHEN Yannan, LU Jie. Improving Detection and Positioning of Insulators in YOLO v7 [J]. Computer Engineering and Applications, 2024, 60(4): 183-191. |
[9] | CAO Ce, CHEN Yan, ZHOU Lanjiang. Financial Fraud Recognition Method for Listed Companies Based on Deep Learning and Textual Emotion [J]. Computer Engineering and Applications, 2024, 60(4): 338-346. |
[10] | LIU Bingkun, PI Jiatian, XU Jin. End-to-End Robotic Arm Vision Servo Research Combined with Bottleneck Attention Mechanism [J]. Computer Engineering and Applications, 2024, 60(4): 347-354. |
[11] | MA Hansheng, ZHU Yuhua, LI Zhihui, YAN Lei, SI Yiyi, LIAN Yimeng, ZHANG Yuhan. Survey of Neural Radiance Fields for Multi-View Synthesis Technologies [J]. Computer Engineering and Applications, 2024, 60(4): 21-38. |
[12] | ZHU Kai, LI Li, ZHANG Tong, JIANG Sheng, BIE Yiming. Survey of Vision Transformer in Low-Level Computer Vision [J]. Computer Engineering and Applications, 2024, 60(4): 39-56. |
[13] | JIANG Wentao, WANG Deqiang, ZHANG Shengchong. Correlation Filtering Target Tracking Algorithm Based on Nonlinear Spatio-Temporal Regularization [J]. Computer Engineering and Applications, 2024, 60(3): 165-176. |
[14] | JIN Haibo, MA Linlin, TIAN Guiyuan. Single Image Defogging Method Under Adaptive Transformer Network [J]. Computer Engineering and Applications, 2024, 60(3): 237-245. |
[15] | WU Zeju, SONG Lijun, JI Yang. Tire X-Ray Image Defect Detection Based on Improved Feature Pyramid Network [J]. Computer Engineering and Applications, 2024, 60(3): 270-279. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||