Survey of Dense Video Captioning

doi:10.3778/j.issn.1002-8331.2209-0236

Abstract

Abstract: Dense video captioning is one kind of video understanding, which bridges computer vision and natural language processing communities. It aims to localize event proposals based on content and describe videos containing rich events into the natural language used by humans for everyday communication. Despite from conventional single-sentence video captioning, the input video of dense video captioning no longer needs to be trimmed for a single event, and the output description text is a description paragraph based on events. Firstly, this paper surveys the basic principles and problems of dense video captioning methods, with which to present the main difficulties and challenges in this field. Secondly, the improvement of the current mainstream methods are elaborated, which are categorized into event proposal, encoding, decoding, adding other auxiliary models and basing on the overall process. Then, this paper summarizes benchmark and evaluation methodology in this field, meanwhile compares the performance of typical methods. Finally, the future directions and prospects of dense video captioning from the aspects of techniques and applications are discussed.

Key words: dense video captioning, video captioning, video understanding, computer vision, natural language processing

摘要： 密集视频描述是视频理解的重要分支之一，也是计算机视觉与自然语言处理领域交叉的热点研究方向。其主要目的是对包含丰富事件的视频进行针对内容的事件定位，并将其描述为人类日常沟通所用的自然语言。与生成单句描述文本的传统视频描述任务相比，密集视频描述的输入视频不再需要进行针对单一事件的裁剪，输出描述文本为针对视频内多个事件的描述段落。简要概述了密集视频描述方法的基本原理及存在问题，并总结了该领域主要面临的研究困难与挑战；对目前主流的密集视频描述方法，依照其对实现流程不同阶段分为基于事件建议、基于编码、基于解码、加入其他辅助模型，以及基于整体流程等五种类别，分别介绍其实现方式及优缺点；对本领域相关数据集以及评价方式进行总结，并对不同方法在相关数据集上的评价结果进行对比；简要讨论密集视频描述技术及其应用的未来发展方向。

关键词: 密集视频描述, 视频描述, 视频理解, 计算机视觉, 自然语言处理

HUANG Xiankai, ZHANG Jiayu, WANG Xinyu, WANG Xiaochuan, LIU Ruijun. Survey of Dense Video Captioning[J]. Computer Engineering and Applications, 2023, 59(12): 28-48.

黄先开, 张佳玉, 王馨宇, 王晓川, 刘瑞军. 密集视频描述研究方法综述[J]. 计算机工程与应用, 2023, 59(12): 28-48.

References

[1] LI S，TAO Z，LI K，et al.Visual to text：survey of image and video captioning[J].IEEE Transactions on Emerging Topics in Computational Intelligence，2019，3（4）：297-312.
[2] NAGEL H.A vision of “Vision and Language” comprises action：an example from road traffic[J].The Artificial Intelligence Review，1994，8（2/3）：189-214.
[3] BARBU A，BRIDGE A，BURCHILL Z，et al.Video in sentences out[C]//International Conference on Uncertainty in Artificial Intelligence，Catalina Island，CA，USA，Aug 14-18，2012：102-112.
[4] HANCKMANN P，SCHUTTE K，BURGHOUTS G.Automated textual descriptions for a wide range of video events with 48 human actions[C]//European Conference on Computer Vision，Florence，Oct 7-13，2012.Berlin：Springer，2012：372-380.
[5] PRADIPTO D，XU C，RICHARD F，et al.A thousand frames in just a few words：lingual description of videos through latent topics and sparse object stitching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Portland，Jun 23-28，2013.Piscataway：IEEE，2013：2634-2641.
[6] KRISHNAMOORTHY N，MALKARNENKAR G，MOONEY R，et al.Generating natural-language video descriptions using text-mined knowledge[C]//Twenty-Seventh AAAI Conference on Artificial Intelligence，2013：541-547.
[7] ROHRBACH M，QIU W，TITOV I，et al.Translating video content to natural language descriptions[C]//Proceedings of the IEEE International Conference on Computer Vision，Sydney，2013：433-440.
[8] XU R，XIONG C，CHEN W，et al.Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]//Proceedings of the AAAI Conference on Artificial Intelligence，Austin，2015：2346-2352.
[9] JESSE T，SUBHASHINI V，SERGIO G，et al.Integrating language and vision to generate natural language descriptions of videos in the wild[C]//International Conference on Computational Linguistics（COLING），2014：1218-1227.
[10] GUADARRAMA S，KRISHNAMOORTHY N，MALKARNENKAR G，et al.Youtube2text：recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of the IEEE International Conference on Computer Vision，Sydney，2013：2712-2719.
[11] VENUGOPALAN S，ROHRBACH M，DONAHUE J，et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision，Santiago，2015：4534-4542.
[12] JUSTIN J，ANDREJ K，LI F F.Densecap：fully convolutional localization networks for dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，2016：4565-4574.
[13] KRISHNA R，HATA K，REN F，et al.Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision，Venice，2017：706-715.
[14] ZHOU L，ZHOU Y，CORSO J，et al.End-to-end dense video captioning with masked transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Salt Lake City，2018：8739-8748.
[15] YANG D，YUAN C.Hierarchical context encoding for events captioning in videos[C]//Proceedings of IEEE International Conference on Image Processing（ICIP），2018：1288-1292.
[16] WANG J，JIANG W，MA L，et al.Bidirectional attentive fusion with context gating for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Salt Lake City，2018：7190-7198.
[17] LI Y，YAO T，PAN Y，et al.Jointly localizing and describing events for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Salt Lake City，2018：7492-7500.
[18] MUN J，YANG L，REN Z，et al.Streamlined dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，2019：6588-6597.
[19] WANG T，ZHENG H，YU M，et al.Event-centric hierarchical representation for dense video captioning[J].IEEE Transactions on Circuits and Systems for Video Technology，2021，31（5）：1890-1900.
[20] BARALDI L，GRANA C，CUCCHIARA R，Hierarchical boundary-aware neural encoder for video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，2017：3185-3194.
[21] CHEN S，JIANG Y.Motion guided spatial attention for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：8191-8198.
[22] DENG C，CHEN S，CHEN D，et al.Sketch，ground，and refine：top-down dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：234-243.
[23] ZHANG X S，GAO K，ZHANG Y D，et al.Task-driven dynamic fusion：reducing ambiguity in video description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，2017：6250-6258.
[24] HORI C，HORI T，LEE T Y，et al.Attention-based multimodal fusion for video description[C]//Proceedings of the IEEE International Conference on Computer Vision，Venice，2017：4203-4212.
[25] CHEN S，JIANG W，LIU W，et al.Learning modality interaction for temporal sentence localization and event captioning in videos[C]//Proceedings of European Conference on Computer Vision，Glasgow，2020：333-351.
[26] RYU H，KANG S，KANG H，et al.Semantic grouping network for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：2514-2522.
[27] LIN K，GAN Z，WANG L.Augmented partial mutual learning with frame masking for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：2047-2055.
[28] AAFAQ N，AKHTAR N，LIU W，et al.Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，2019：12487-12496.
[29] ZHANG J，PENG Y.Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，2019：8327-8336.
[30] HOU J，WU X，ZHAO W，et al.Joint syntax representation learning and visual cue translation for video captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，Seoul，2019：8917-8926.
[31] WANG B，MA L，ZHANG W，et al.Controllable video captioning with POS sequence guidance based on gated fusion network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，Seoul，2019：2641-2650.
[32] ZHANG Z，SHI Y，YUAN C，et al.Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Seattle，2020：13275-13285.
[33] PAN B，CAI H，HUANG D，et al.Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Seattle，2020：10867-10876.
[34] SUIN M，RAJAGOPALAN A N.An efficient framework for dense video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：12039-12046.
[35] SONG Y，CHEN S，JIN Q.Towards diverse paragraph captioning for untrimmed videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：11245-11254.
[36] YU H，WANG J，HUANG Z，et al.Video paragraph captioning using hierarchical recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，2016：4584-4593.
[37] YANG B，ZOU Y，LIU F，et al.Non-autoregressive coarse-to-fine video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：3119-3127.
[38] PEI W，ZHANG J，WANG X，et al.Memory-attended recurrent network for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，2019：8347-8356.
[39] ZHENG Q，WANG C，TAO D.Syntax-aware action targeting for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Seattle，2020：13093-13102.
[40] ZHANG Z，QI Z，YUAN C，et al.Open-book video captioning with retrieve-copy-generate network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：9837-9846.
[41] SHEN Z，LI J，SU Z，et al.Weakly supervised dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，2017：5159-5167.
[42] DUAN X，HUANG W，GAN C，et al.Weakly supervised dense event captioning in videos[C]//Conference on Neural Information Processing Systems（NeurIPS），2018：3063-3073.
[43] RAHMAN T，XU B C，SIGAL L.Watch，listen and tell：multi-modal weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，Seoul，2019：8907-8916.
[44] WU B，NIU G，YU J，et al.Weakly supervised dense video captioning via jointly usage of knowledge distillation and cross-modal matching[C]//International Joint Conference on Artificial Intelligence（IJCAI），2021：1157-1164.
[45] CHEN S，JIANG Y G.Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：8421-8431.
[46] CHEN S，YAO T，JIANG Y G.Deep learning for video captioning：a review[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence，2019：6283-6290.
[47] 汤鹏杰，王瀚漓.从视频到语言：视频标题生成与描述研究综述[J]自动化学报，2022，48（2）：375-397.
TANG P J，WANG H L.From video to language：survey of video captioning and description[J].Acta Automatica Sinica，2022，48（2）：375-397.
[48] JAIN V，AL-TURJMAN F，CHAUDHARY G，et al.Video captioning：a review of theory，techniques and practices[J].Multimedia Tools and Applications，2022：1-35.
[49] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Conference on Neural Information Processing Systems（NeurIPS），2017：5998-6008.
[50] ESCORCIA V，CABA H F，NIEBLES J C，et al.DAPs：deep action proposals for action understanding[C]//Proceedings of European Conference on Computer Vision，Amsterdam，2016：768-784.
[51] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，2016：770-778.
[52] XIE S，GIRSHICK R，DOLLAR P，et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，2017：5987-5995.
[53] SZEGEDY C，IOFFE S，VANHOUCKE V，et al.Inception-v4，inception-resnet and the impact of residual connections on learning[C]//Proceedings of AAAI Conference on Artificial Intelligence，San Francisco，2017：4278-4284.
[54] LEE S J，KIM I.DVC-Net：a deep neural network model for dense video captioning[J]IET Computer Vision，2021，15（1）：12-23.
[55] SHYAMAL B，VICTOR E，CHUANQI S，et al.SST： single-stream temporal action proposals[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion，Honolulu，2017：6373-6382.
[56] CHANG Z，ZHAO D X，CHEN H L，et al.Eventcentric multi-modal fusion method for dense video captioning[J].Neural Networks，2022，146：120-129.
[57] VINYALS O，FORTUNATO M，JAITLY N.Pointer networks[C]//Conference on Neural Information Processing Systems（NeurIPS），2015：2692-2700.
[58] ZHOU L，XU C，CORSO J.Towards automatic learning of procedures from web instructional videos[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence，New Orleans，2018：7590-7598.
[59] JI L，GUO X，HUANG H，et al.Hierarchical context-aware network for dense video event captioning[C]//Annual Meeting of the Association for Computational Linguistics（ACL），2021：2004-2013.
[60] HESSEL J，PANG B，ZHU Z，et al.A case study on combining asr and visual features for generating instructional video captions[C]//Conference on Computational Natural Language Learning（CoNLL），2019：419-429.
[61] LU C H，FAN G Y.Environment-aware dense video captioning for iot-enabled edge cameras[J].IEEE Internet of Things Journal，2022，9（6）：4554-4564.
[62] HE X，SHI B，BAI X，et al.Image caption generation with part of speech guidance[J].Pattern Recognition Letters，2019，119：229-237.
[63] DEVLIN J，CHANG M，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[64] GUO J，TAN X，HE D，et al.Non-autoregressive neural machine translation with enhanced decoder input[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence（AAAI 2019），Honolulu，2019：3723-3730.
[65] GHAZVININEJAD M，LEVY O，LIU Y，et al.Mask-predict：parallel decoding of conditional masked language models[J].arXiv：1904.09324，2019.
[66] CUBUK E D，ZOPH B，MANE D，et al.Autoaugment：learning augmentation policies from data[J].arXiv：1805.
09501，2018.
[67] WANG T，ZHANG R，LU Z，et al.End-to-end dense video captioning with parallel decoding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：6827-6837.
[68] HEDI B Y，REMI C，MATTHIEU C，et al.MUTAN：multi-modal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision，Venice，2017：2631-2639.
[69] FABIAN C H，VICTOR E，BERNARD G.Activitynet：a large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Boston，2015：961-970.
[70] ROHRBACH A，ROHRBACH M，QIU W，et al.Coherent multi-sentence video description with variable level of detail[C]//German Conference on Pattern Recognition，Germany，2014：184-195.
[71] DAVID L，WILLIAM B.Collecting highly parallel data for paraphrase evaluation[C]//49th Annual Meeting of the Association for Computational Linguistics，Portland，2011：190-200.
[72] ROHRBACH A，ROHRBACH M，TANDON N，et al.A dataset for movie description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Boston，2015：3202-3212.
[73] TORABI A，PAI C，LAROCHELLE H，et al.Using descriptive video services to create a large data source for video annotation research[J].arXiv：1503.01070，2015.
[74] XU J，MEI T，YAO T，et al.MSR-VTT：a large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，2016：5288-5296.
[75] ROHRBACH A，TORABI A，ROHRBACH M，et al.Movie description[J].International Journal of Computer Vision，2017：94-120.
[76] KISHORE P，SALIM R，TODD W，et al.Bleu：a method for automatic evaluation of machine translation[C]//Annual Meeting of the Association for Computational Linguistics（ACL），Philadelphia，2002：311-318.
[77] ALON L，ABHAYA A.Meteor： an automatic metric for mt evaluation with high levels of correlation with human judgments[C]//Annual Meeting of the Association for Computational Linguistics（ACL），Prague，2007：228-231.
[78] LIN C Y，FRANZ J O.Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics[C]//Annual Meeting of the Association for Computational Linguistics（ACL），Barcelona，2004：605-612.
[79] RAMAKRISHNA V C，LAWRENCE Z，DEVI P.Cider：consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Boston，2015：4566-4575.
[80] FUJITA S，HIRAO T，KAMIGAITO H，et al.SODA：story oriented dense video captioning evaluation framework[C]//Proceedings of European Conference on Computer Vision（ECCV），Glasgow，2020：517-531.
[81] FEI Z C.Memory-augmented image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：1317-1324.
[82] XU G，NIU S，TAN M，et al.Towards accurate text-based image captioning with content diversity exploration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：12637-12646.
[83] VLADIMIR I，ESA R.Multi-modal dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops，Seattle，2020：4117-4126.
[84] LEI J，WANG L W，SHEN Y L，et al.MART：memory-augmented recurrent transformer for coherent video paragraph captioning[J].arXiv：2005.05402，2020.
[85] ZOLFAGHARI M，SINGH K，BROX T.ECO：efficient convolutional network for online video understanding[C]//Proceedings of the European Conference on Computer Vision，2018：695-712.
[86] CHEN L，JIANG Z H，XIAO J，et al.Human-like controllable image captioning with verb-specific semantic roles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：16846-16856.
[87] DENG C R，DING N，TAN M K，et al.Length-controllable image captioning[C]//Proceedings of European Conference on Computer Vision（ECCV），Glasgow，2020：712-729.
[88] LUO Y，JI J，SUN X，et al.Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：2286-2293.
[89] OLEKSII S，HU R H，MARCUS R，et al.Textcaps：a dataset for image captioning with reading comprehension[C]//Proceedings of European Conference on Computer Vision（ECCV），Glasgow，2020：742-758.
[90] WANG J，TANG J H，YANG M K，et al.Improving ocr-based image captioning by incorporating geometrical relationship[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：1306-1315.
[91] HU X，YIN X，LIN K，et al.Vivo：visual vocabulary pretraining for novel object captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：1575-1583.
[92] AGRAWAL H，DESAI K，WANG Y，et al.Nocaps：novel object captioning at scale[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，Seoul，2019：8947-8956.
[93] ZHAN R，LIU X，WONG D F，et al.Meta-curriculum learning for domain adaptation in neural machine translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：14310-14318.
[94] NISHIKAWA S，RI R，TSURUOKA Y.Data augmentation with unsupervised machine translation improves the structural similarity of cross-lingual word embeddings[C]//Annual Meeting of the Association for Computational Linguistics（ACL），2021：163-173.