计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (12): 28-48.DOI: 10.3778/j.issn.1002-8331.2209-0236
黄先开,张佳玉,王馨宇,王晓川,刘瑞军
出版日期:
2023-06-15
发布日期:
2023-06-15
HUANG Xiankai, ZHANG Jiayu, WANG Xinyu, WANG Xiaochuan, LIU Ruijun
Online:
2023-06-15
Published:
2023-06-15
摘要: 密集视频描述是视频理解的重要分支之一,也是计算机视觉与自然语言处理领域交叉的热点研究方向。其主要目的是对包含丰富事件的视频进行针对内容的事件定位,并将其描述为人类日常沟通所用的自然语言。与生成单句描述文本的传统视频描述任务相比,密集视频描述的输入视频不再需要进行针对单一事件的裁剪,输出描述文本为针对视频内多个事件的描述段落。简要概述了密集视频描述方法的基本原理及存在问题,并总结了该领域主要面临的研究困难与挑战;对目前主流的密集视频描述方法,依照其对实现流程不同阶段分为基于事件建议、基于编码、基于解码、加入其他辅助模型,以及基于整体流程等五种类别,分别介绍其实现方式及优缺点;对本领域相关数据集以及评价方式进行总结,并对不同方法在相关数据集上的评价结果进行对比;简要讨论密集视频描述技术及其应用的未来发展方向。
黄先开, 张佳玉, 王馨宇, 王晓川, 刘瑞军. 密集视频描述研究方法综述[J]. 计算机工程与应用, 2023, 59(12): 28-48.
HUANG Xiankai, ZHANG Jiayu, WANG Xinyu, WANG Xiaochuan, LIU Ruijun. Survey of Dense Video Captioning[J]. Computer Engineering and Applications, 2023, 59(12): 28-48.
[1] LI S,TAO Z,LI K,et al.Visual to text:survey of image and video captioning[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2019,3(4):297-312. [2] NAGEL H.A vision of “Vision and Language” comprises action:an example from road traffic[J].The Artificial Intelligence Review,1994,8(2/3):189-214. [3] BARBU A,BRIDGE A,BURCHILL Z,et al.Video in sentences out[C]//International Conference on Uncertainty in Artificial Intelligence,Catalina Island,CA,USA,Aug 14-18,2012:102-112. [4] HANCKMANN P,SCHUTTE K,BURGHOUTS G.Automated textual descriptions for a wide range of video events with 48 human actions[C]//European Conference on Computer Vision,Florence,Oct 7-13,2012.Berlin:Springer,2012:372-380. [5] PRADIPTO D,XU C,RICHARD F,et al.A thousand frames in just a few words:lingual description of videos through latent topics and sparse object stitching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Portland,Jun 23-28,2013.Piscataway:IEEE,2013:2634-2641. [6] KRISHNAMOORTHY N,MALKARNENKAR G,MOONEY R,et al.Generating natural-language video descriptions using text-mined knowledge[C]//Twenty-Seventh AAAI Conference on Artificial Intelligence,2013:541-547. [7] ROHRBACH M,QIU W,TITOV I,et al.Translating video content to natural language descriptions[C]//Proceedings of the IEEE International Conference on Computer Vision,Sydney,2013:433-440. [8] XU R,XIONG C,CHEN W,et al.Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]//Proceedings of the AAAI Conference on Artificial Intelligence,Austin,2015:2346-2352. [9] JESSE T,SUBHASHINI V,SERGIO G,et al.Integrating language and vision to generate natural language descriptions of videos in the wild[C]//International Conference on Computational Linguistics(COLING),2014:1218-1227. [10] GUADARRAMA S,KRISHNAMOORTHY N,MALKARNENKAR G,et al.Youtube2text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of the IEEE International Conference on Computer Vision,Sydney,2013:2712-2719. [11] VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision,Santiago,2015:4534-4542. [12] JUSTIN J,ANDREJ K,LI F F.Densecap:fully convolutional localization networks for dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:4565-4574. [13] KRISHNA R,HATA K,REN F,et al.Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision,Venice,2017:706-715. [14] ZHOU L,ZHOU Y,CORSO J,et al.End-to-end dense video captioning with masked transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,2018:8739-8748. [15] YANG D,YUAN C.Hierarchical context encoding for events captioning in videos[C]//Proceedings of IEEE International Conference on Image Processing(ICIP),2018:1288-1292. [16] WANG J,JIANG W,MA L,et al.Bidirectional attentive fusion with context gating for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,2018:7190-7198. [17] LI Y,YAO T,PAN Y,et al.Jointly localizing and describing events for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,2018:7492-7500. [18] MUN J,YANG L,REN Z,et al.Streamlined dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:6588-6597. [19] WANG T,ZHENG H,YU M,et al.Event-centric hierarchical representation for dense video captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,31(5):1890-1900. [20] BARALDI L,GRANA C,CUCCHIARA R,Hierarchical boundary-aware neural encoder for video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:3185-3194. [21] CHEN S,JIANG Y.Motion guided spatial attention for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:8191-8198. [22] DENG C,CHEN S,CHEN D,et al.Sketch,ground,and refine:top-down dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:234-243. [23] ZHANG X S,GAO K,ZHANG Y D,et al.Task-driven dynamic fusion:reducing ambiguity in video description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:6250-6258. [24] HORI C,HORI T,LEE T Y,et al.Attention-based multimodal fusion for video description[C]//Proceedings of the IEEE International Conference on Computer Vision,Venice,2017:4203-4212. [25] CHEN S,JIANG W,LIU W,et al.Learning modality interaction for temporal sentence localization and event captioning in videos[C]//Proceedings of European Conference on Computer Vision,Glasgow,2020:333-351. [26] RYU H,KANG S,KANG H,et al.Semantic grouping network for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:2514-2522. [27] LIN K,GAN Z,WANG L.Augmented partial mutual learning with frame masking for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:2047-2055. [28] AAFAQ N,AKHTAR N,LIU W,et al.Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:12487-12496. [29] ZHANG J,PENG Y.Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:8327-8336. [30] HOU J,WU X,ZHAO W,et al.Joint syntax representation learning and visual cue translation for video captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:8917-8926. [31] WANG B,MA L,ZHANG W,et al.Controllable video captioning with POS sequence guidance based on gated fusion network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:2641-2650. [32] ZHANG Z,SHI Y,YUAN C,et al.Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle,2020:13275-13285. [33] PAN B,CAI H,HUANG D,et al.Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle,2020:10867-10876. [34] SUIN M,RAJAGOPALAN A N.An efficient framework for dense video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:12039-12046. [35] SONG Y,CHEN S,JIN Q.Towards diverse paragraph captioning for untrimmed videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:11245-11254. [36] YU H,WANG J,HUANG Z,et al.Video paragraph captioning using hierarchical recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:4584-4593. [37] YANG B,ZOU Y,LIU F,et al.Non-autoregressive coarse-to-fine video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:3119-3127. [38] PEI W,ZHANG J,WANG X,et al.Memory-attended recurrent network for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:8347-8356. [39] ZHENG Q,WANG C,TAO D.Syntax-aware action targeting for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle,2020:13093-13102. [40] ZHANG Z,QI Z,YUAN C,et al.Open-book video captioning with retrieve-copy-generate network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:9837-9846. [41] SHEN Z,LI J,SU Z,et al.Weakly supervised dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:5159-5167. [42] DUAN X,HUANG W,GAN C,et al.Weakly supervised dense event captioning in videos[C]//Conference on Neural Information Processing Systems(NeurIPS),2018:3063-3073. [43] RAHMAN T,XU B C,SIGAL L.Watch,listen and tell:multi-modal weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:8907-8916. [44] WU B,NIU G,YU J,et al.Weakly supervised dense video captioning via jointly usage of knowledge distillation and cross-modal matching[C]//International Joint Conference on Artificial Intelligence(IJCAI),2021:1157-1164. [45] CHEN S,JIANG Y G.Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:8421-8431. [46] CHEN S,YAO T,JIANG Y G.Deep learning for video captioning:a review[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,2019:6283-6290. [47] 汤鹏杰,王瀚漓.从视频到语言:视频标题生成与描述研究综述[J]自动化学报,2022,48(2):375-397. TANG P J,WANG H L.From video to language:survey of video captioning and description[J].Acta Automatica Sinica,2022,48(2):375-397. [48] JAIN V,AL-TURJMAN F,CHAUDHARY G,et al.Video captioning:a review of theory,techniques and practices[J].Multimedia Tools and Applications,2022:1-35. [49] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Conference on Neural Information Processing Systems(NeurIPS),2017:5998-6008. [50] ESCORCIA V,CABA H F,NIEBLES J C,et al.DAPs:deep action proposals for action understanding[C]//Proceedings of European Conference on Computer Vision,Amsterdam,2016:768-784. [51] HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:770-778. [52] XIE S,GIRSHICK R,DOLLAR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:5987-5995. [53] SZEGEDY C,IOFFE S,VANHOUCKE V,et al.Inception-v4,inception-resnet and the impact of residual connections on learning[C]//Proceedings of AAAI Conference on Artificial Intelligence,San Francisco,2017:4278-4284. [54] LEE S J,KIM I.DVC-Net:a deep neural network model for dense video captioning[J]IET Computer Vision,2021,15(1):12-23. [55] SHYAMAL B,VICTOR E,CHUANQI S,et al.SST: single-stream temporal action proposals[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion,Honolulu,2017:6373-6382. [56] CHANG Z,ZHAO D X,CHEN H L,et al.Eventcentric multi-modal fusion method for dense video captioning[J].Neural Networks,2022,146:120-129. [57] VINYALS O,FORTUNATO M,JAITLY N.Pointer networks[C]//Conference on Neural Information Processing Systems(NeurIPS),2015:2692-2700. [58] ZHOU L,XU C,CORSO J.Towards automatic learning of procedures from web instructional videos[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence,New Orleans,2018:7590-7598. [59] JI L,GUO X,HUANG H,et al.Hierarchical context-aware network for dense video event captioning[C]//Annual Meeting of the Association for Computational Linguistics(ACL),2021:2004-2013. [60] HESSEL J,PANG B,ZHU Z,et al.A case study on combining asr and visual features for generating instructional video captions[C]//Conference on Computational Natural Language Learning(CoNLL),2019:419-429. [61] LU C H,FAN G Y.Environment-aware dense video captioning for iot-enabled edge cameras[J].IEEE Internet of Things Journal,2022,9(6):4554-4564. [62] HE X,SHI B,BAI X,et al.Image caption generation with part of speech guidance[J].Pattern Recognition Letters,2019,119:229-237. [63] DEVLIN J,CHANG M,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [64] GUO J,TAN X,HE D,et al.Non-autoregressive neural machine translation with enhanced decoder input[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence(AAAI 2019),Honolulu,2019:3723-3730. [65] GHAZVININEJAD M,LEVY O,LIU Y,et al.Mask-predict:parallel decoding of conditional masked language models[J].arXiv:1904.09324,2019. [66] CUBUK E D,ZOPH B,MANE D,et al.Autoaugment:learning augmentation policies from data[J].arXiv:1805. 09501,2018. [67] WANG T,ZHANG R,LU Z,et al.End-to-end dense video captioning with parallel decoding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:6827-6837. [68] HEDI B Y,REMI C,MATTHIEU C,et al.MUTAN:multi-modal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision,Venice,2017:2631-2639. [69] FABIAN C H,VICTOR E,BERNARD G.Activitynet:a large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,2015:961-970. [70] ROHRBACH A,ROHRBACH M,QIU W,et al.Coherent multi-sentence video description with variable level of detail[C]//German Conference on Pattern Recognition,Germany,2014:184-195. [71] DAVID L,WILLIAM B.Collecting highly parallel data for paraphrase evaluation[C]//49th Annual Meeting of the Association for Computational Linguistics,Portland,2011:190-200. [72] ROHRBACH A,ROHRBACH M,TANDON N,et al.A dataset for movie description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,2015:3202-3212. [73] TORABI A,PAI C,LAROCHELLE H,et al.Using descriptive video services to create a large data source for video annotation research[J].arXiv:1503.01070,2015. [74] XU J,MEI T,YAO T,et al.MSR-VTT:a large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:5288-5296. [75] ROHRBACH A,TORABI A,ROHRBACH M,et al.Movie description[J].International Journal of Computer Vision,2017:94-120. [76] KISHORE P,SALIM R,TODD W,et al.Bleu:a method for automatic evaluation of machine translation[C]//Annual Meeting of the Association for Computational Linguistics(ACL),Philadelphia,2002:311-318. [77] ALON L,ABHAYA A.Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments[C]//Annual Meeting of the Association for Computational Linguistics(ACL),Prague,2007:228-231. [78] LIN C Y,FRANZ J O.Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics[C]//Annual Meeting of the Association for Computational Linguistics(ACL),Barcelona,2004:605-612. [79] RAMAKRISHNA V C,LAWRENCE Z,DEVI P.Cider:consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,2015:4566-4575. [80] FUJITA S,HIRAO T,KAMIGAITO H,et al.SODA:story oriented dense video captioning evaluation framework[C]//Proceedings of European Conference on Computer Vision(ECCV),Glasgow,2020:517-531. [81] FEI Z C.Memory-augmented image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:1317-1324. [82] XU G,NIU S,TAN M,et al.Towards accurate text-based image captioning with content diversity exploration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:12637-12646. [83] VLADIMIR I,ESA R.Multi-modal dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,Seattle,2020:4117-4126. [84] LEI J,WANG L W,SHEN Y L,et al.MART:memory-augmented recurrent transformer for coherent video paragraph captioning[J].arXiv:2005.05402,2020. [85] ZOLFAGHARI M,SINGH K,BROX T.ECO:efficient convolutional network for online video understanding[C]//Proceedings of the European Conference on Computer Vision,2018:695-712. [86] CHEN L,JIANG Z H,XIAO J,et al.Human-like controllable image captioning with verb-specific semantic roles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:16846-16856. [87] DENG C R,DING N,TAN M K,et al.Length-controllable image captioning[C]//Proceedings of European Conference on Computer Vision(ECCV),Glasgow,2020:712-729. [88] LUO Y,JI J,SUN X,et al.Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:2286-2293. [89] OLEKSII S,HU R H,MARCUS R,et al.Textcaps:a dataset for image captioning with reading comprehension[C]//Proceedings of European Conference on Computer Vision(ECCV),Glasgow,2020:742-758. [90] WANG J,TANG J H,YANG M K,et al.Improving ocr-based image captioning by incorporating geometrical relationship[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:1306-1315. [91] HU X,YIN X,LIN K,et al.Vivo:visual vocabulary pretraining for novel object captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:1575-1583. [92] AGRAWAL H,DESAI K,WANG Y,et al.Nocaps:novel object captioning at scale[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:8947-8956. [93] ZHAN R,LIU X,WONG D F,et al.Meta-curriculum learning for domain adaptation in neural machine translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:14310-14318. [94] NISHIKAWA S,RI R,TSURUOKA Y.Data augmentation with unsupervised machine translation improves the structural similarity of cross-lingual word embeddings[C]//Annual Meeting of the Association for Computational Linguistics(ACL),2021:163-173. |
[1] | 何家峰, 陈宏伟, 骆德汉. 深度学习实时语义分割算法研究综述[J]. 计算机工程与应用, 2023, 59(8): 13-27. |
[2] | 王小明, 毛语实, 徐斌, 王子磊. 内容结构保持的图像风格迁移方法[J]. 计算机工程与应用, 2023, 59(6): 146-154. |
[3] | 肖振久, 李鑫. 密度峰值聚类的全尺度相关滤波跟踪方法[J]. 计算机工程与应用, 2023, 59(5): 131-139. |
[4] | 肖立中, 臧中兴, 宋赛赛. 融合自注意力的关系抽取级联标记框架研究[J]. 计算机工程与应用, 2023, 59(3): 77-83. |
[5] | 林令德, 刘纳, 王正安. Adapter与Prompt Tuning微调方法研究综述[J]. 计算机工程与应用, 2023, 59(2): 12-21. |
[6] | 景丽, 姚克. 融合知识图谱和多模态的文本分类研究[J]. 计算机工程与应用, 2023, 59(2): 102-109. |
[7] | 刘泽旖, 余文华, 洪智勇, 柯冠舟, 谭荣杰. 基于问题回答模式的中文事件抽取[J]. 计算机工程与应用, 2023, 59(2): 153-160. |
[8] | 郑博飞, 云静, 刘利民, 焦磊, 袁静姝. 跨语言摘要方法研究综述[J]. 计算机工程与应用, 2023, 59(13): 49-60. |
[9] | 杨冬, 田生伟, 禹龙, 周铁军, 王博. 快速联合实体和关系抽取模型[J]. 计算机工程与应用, 2023, 59(13): 164-170. |
[10] | 杨锋, 丁之桐, 邢蒙蒙, 丁波. 深度学习的目标检测算法改进综述[J]. 计算机工程与应用, 2023, 59(11): 1-15. |
[11] | 董刚, 谢维成, 黄小龙, 乔逸天, 毛骞. 深度学习小目标检测算法综述[J]. 计算机工程与应用, 2023, 59(11): 16-27. |
[12] | 姜中敏, 张婉言, 王文举. 单幅RGB图像计算光谱成像的深度学习研究综述[J]. 计算机工程与应用, 2023, 59(10): 22-34. |
[13] | 李翔, 张涛, 张哲, 魏宏杨, 钱育蓉. Transformer在计算机视觉领域的研究综述[J]. 计算机工程与应用, 2023, 59(1): 1-14. |
[14] | 付苗苗, 邓淼磊, 张德贤. 基于深度学习和Transformer的目标检测算法[J]. 计算机工程与应用, 2023, 59(1): 37-48. |
[15] | 徐尹翔, 陈祺东, 孙俊. 应用量子行为粒子群优化算法的文本对抗[J]. 计算机工程与应用, 2022, 58(9): 175-180. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||