Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (12): 28-48.DOI: 10.3778/j.issn.1002-8331.2209-0236
• Research Hotspots and Reviews • Previous Articles Next Articles
HUANG Xiankai, ZHANG Jiayu, WANG Xinyu, WANG Xiaochuan, LIU Ruijun
Online:
2023-06-15
Published:
2023-06-15
黄先开,张佳玉,王馨宇,王晓川,刘瑞军
HUANG Xiankai, ZHANG Jiayu, WANG Xinyu, WANG Xiaochuan, LIU Ruijun. Survey of Dense Video Captioning[J]. Computer Engineering and Applications, 2023, 59(12): 28-48.
黄先开, 张佳玉, 王馨宇, 王晓川, 刘瑞军. 密集视频描述研究方法综述[J]. 计算机工程与应用, 2023, 59(12): 28-48.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2209-0236
[1] LI S,TAO Z,LI K,et al.Visual to text:survey of image and video captioning[J].IEEE Transactions on Emerging Topics in Computational Intelligence,2019,3(4):297-312. [2] NAGEL H.A vision of “Vision and Language” comprises action:an example from road traffic[J].The Artificial Intelligence Review,1994,8(2/3):189-214. [3] BARBU A,BRIDGE A,BURCHILL Z,et al.Video in sentences out[C]//International Conference on Uncertainty in Artificial Intelligence,Catalina Island,CA,USA,Aug 14-18,2012:102-112. [4] HANCKMANN P,SCHUTTE K,BURGHOUTS G.Automated textual descriptions for a wide range of video events with 48 human actions[C]//European Conference on Computer Vision,Florence,Oct 7-13,2012.Berlin:Springer,2012:372-380. [5] PRADIPTO D,XU C,RICHARD F,et al.A thousand frames in just a few words:lingual description of videos through latent topics and sparse object stitching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Portland,Jun 23-28,2013.Piscataway:IEEE,2013:2634-2641. [6] KRISHNAMOORTHY N,MALKARNENKAR G,MOONEY R,et al.Generating natural-language video descriptions using text-mined knowledge[C]//Twenty-Seventh AAAI Conference on Artificial Intelligence,2013:541-547. [7] ROHRBACH M,QIU W,TITOV I,et al.Translating video content to natural language descriptions[C]//Proceedings of the IEEE International Conference on Computer Vision,Sydney,2013:433-440. [8] XU R,XIONG C,CHEN W,et al.Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]//Proceedings of the AAAI Conference on Artificial Intelligence,Austin,2015:2346-2352. [9] JESSE T,SUBHASHINI V,SERGIO G,et al.Integrating language and vision to generate natural language descriptions of videos in the wild[C]//International Conference on Computational Linguistics(COLING),2014:1218-1227. [10] GUADARRAMA S,KRISHNAMOORTHY N,MALKARNENKAR G,et al.Youtube2text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of the IEEE International Conference on Computer Vision,Sydney,2013:2712-2719. [11] VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision,Santiago,2015:4534-4542. [12] JUSTIN J,ANDREJ K,LI F F.Densecap:fully convolutional localization networks for dense captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:4565-4574. [13] KRISHNA R,HATA K,REN F,et al.Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision,Venice,2017:706-715. [14] ZHOU L,ZHOU Y,CORSO J,et al.End-to-end dense video captioning with masked transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,2018:8739-8748. [15] YANG D,YUAN C.Hierarchical context encoding for events captioning in videos[C]//Proceedings of IEEE International Conference on Image Processing(ICIP),2018:1288-1292. [16] WANG J,JIANG W,MA L,et al.Bidirectional attentive fusion with context gating for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,2018:7190-7198. [17] LI Y,YAO T,PAN Y,et al.Jointly localizing and describing events for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,2018:7492-7500. [18] MUN J,YANG L,REN Z,et al.Streamlined dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:6588-6597. [19] WANG T,ZHENG H,YU M,et al.Event-centric hierarchical representation for dense video captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,31(5):1890-1900. [20] BARALDI L,GRANA C,CUCCHIARA R,Hierarchical boundary-aware neural encoder for video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:3185-3194. [21] CHEN S,JIANG Y.Motion guided spatial attention for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019:8191-8198. [22] DENG C,CHEN S,CHEN D,et al.Sketch,ground,and refine:top-down dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:234-243. [23] ZHANG X S,GAO K,ZHANG Y D,et al.Task-driven dynamic fusion:reducing ambiguity in video description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:6250-6258. [24] HORI C,HORI T,LEE T Y,et al.Attention-based multimodal fusion for video description[C]//Proceedings of the IEEE International Conference on Computer Vision,Venice,2017:4203-4212. [25] CHEN S,JIANG W,LIU W,et al.Learning modality interaction for temporal sentence localization and event captioning in videos[C]//Proceedings of European Conference on Computer Vision,Glasgow,2020:333-351. [26] RYU H,KANG S,KANG H,et al.Semantic grouping network for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:2514-2522. [27] LIN K,GAN Z,WANG L.Augmented partial mutual learning with frame masking for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:2047-2055. [28] AAFAQ N,AKHTAR N,LIU W,et al.Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:12487-12496. [29] ZHANG J,PENG Y.Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:8327-8336. [30] HOU J,WU X,ZHAO W,et al.Joint syntax representation learning and visual cue translation for video captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:8917-8926. [31] WANG B,MA L,ZHANG W,et al.Controllable video captioning with POS sequence guidance based on gated fusion network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:2641-2650. [32] ZHANG Z,SHI Y,YUAN C,et al.Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle,2020:13275-13285. [33] PAN B,CAI H,HUANG D,et al.Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle,2020:10867-10876. [34] SUIN M,RAJAGOPALAN A N.An efficient framework for dense video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:12039-12046. [35] SONG Y,CHEN S,JIN Q.Towards diverse paragraph captioning for untrimmed videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:11245-11254. [36] YU H,WANG J,HUANG Z,et al.Video paragraph captioning using hierarchical recurrent neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:4584-4593. [37] YANG B,ZOU Y,LIU F,et al.Non-autoregressive coarse-to-fine video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:3119-3127. [38] PEI W,ZHANG J,WANG X,et al.Memory-attended recurrent network for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,2019:8347-8356. [39] ZHENG Q,WANG C,TAO D.Syntax-aware action targeting for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle,2020:13093-13102. [40] ZHANG Z,QI Z,YUAN C,et al.Open-book video captioning with retrieve-copy-generate network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:9837-9846. [41] SHEN Z,LI J,SU Z,et al.Weakly supervised dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:5159-5167. [42] DUAN X,HUANG W,GAN C,et al.Weakly supervised dense event captioning in videos[C]//Conference on Neural Information Processing Systems(NeurIPS),2018:3063-3073. [43] RAHMAN T,XU B C,SIGAL L.Watch,listen and tell:multi-modal weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:8907-8916. [44] WU B,NIU G,YU J,et al.Weakly supervised dense video captioning via jointly usage of knowledge distillation and cross-modal matching[C]//International Joint Conference on Artificial Intelligence(IJCAI),2021:1157-1164. [45] CHEN S,JIANG Y G.Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:8421-8431. [46] CHEN S,YAO T,JIANG Y G.Deep learning for video captioning:a review[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,2019:6283-6290. [47] 汤鹏杰,王瀚漓.从视频到语言:视频标题生成与描述研究综述[J]自动化学报,2022,48(2):375-397. TANG P J,WANG H L.From video to language:survey of video captioning and description[J].Acta Automatica Sinica,2022,48(2):375-397. [48] JAIN V,AL-TURJMAN F,CHAUDHARY G,et al.Video captioning:a review of theory,techniques and practices[J].Multimedia Tools and Applications,2022:1-35. [49] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Conference on Neural Information Processing Systems(NeurIPS),2017:5998-6008. [50] ESCORCIA V,CABA H F,NIEBLES J C,et al.DAPs:deep action proposals for action understanding[C]//Proceedings of European Conference on Computer Vision,Amsterdam,2016:768-784. [51] HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:770-778. [52] XIE S,GIRSHICK R,DOLLAR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,2017:5987-5995. [53] SZEGEDY C,IOFFE S,VANHOUCKE V,et al.Inception-v4,inception-resnet and the impact of residual connections on learning[C]//Proceedings of AAAI Conference on Artificial Intelligence,San Francisco,2017:4278-4284. [54] LEE S J,KIM I.DVC-Net:a deep neural network model for dense video captioning[J]IET Computer Vision,2021,15(1):12-23. [55] SHYAMAL B,VICTOR E,CHUANQI S,et al.SST: single-stream temporal action proposals[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-tion,Honolulu,2017:6373-6382. [56] CHANG Z,ZHAO D X,CHEN H L,et al.Eventcentric multi-modal fusion method for dense video captioning[J].Neural Networks,2022,146:120-129. [57] VINYALS O,FORTUNATO M,JAITLY N.Pointer networks[C]//Conference on Neural Information Processing Systems(NeurIPS),2015:2692-2700. [58] ZHOU L,XU C,CORSO J.Towards automatic learning of procedures from web instructional videos[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence,New Orleans,2018:7590-7598. [59] JI L,GUO X,HUANG H,et al.Hierarchical context-aware network for dense video event captioning[C]//Annual Meeting of the Association for Computational Linguistics(ACL),2021:2004-2013. [60] HESSEL J,PANG B,ZHU Z,et al.A case study on combining asr and visual features for generating instructional video captions[C]//Conference on Computational Natural Language Learning(CoNLL),2019:419-429. [61] LU C H,FAN G Y.Environment-aware dense video captioning for iot-enabled edge cameras[J].IEEE Internet of Things Journal,2022,9(6):4554-4564. [62] HE X,SHI B,BAI X,et al.Image caption generation with part of speech guidance[J].Pattern Recognition Letters,2019,119:229-237. [63] DEVLIN J,CHANG M,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [64] GUO J,TAN X,HE D,et al.Non-autoregressive neural machine translation with enhanced decoder input[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence(AAAI 2019),Honolulu,2019:3723-3730. [65] GHAZVININEJAD M,LEVY O,LIU Y,et al.Mask-predict:parallel decoding of conditional masked language models[J].arXiv:1904.09324,2019. [66] CUBUK E D,ZOPH B,MANE D,et al.Autoaugment:learning augmentation policies from data[J].arXiv:1805. 09501,2018. [67] WANG T,ZHANG R,LU Z,et al.End-to-end dense video captioning with parallel decoding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:6827-6837. [68] HEDI B Y,REMI C,MATTHIEU C,et al.MUTAN:multi-modal tucker fusion for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision,Venice,2017:2631-2639. [69] FABIAN C H,VICTOR E,BERNARD G.Activitynet:a large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,2015:961-970. [70] ROHRBACH A,ROHRBACH M,QIU W,et al.Coherent multi-sentence video description with variable level of detail[C]//German Conference on Pattern Recognition,Germany,2014:184-195. [71] DAVID L,WILLIAM B.Collecting highly parallel data for paraphrase evaluation[C]//49th Annual Meeting of the Association for Computational Linguistics,Portland,2011:190-200. [72] ROHRBACH A,ROHRBACH M,TANDON N,et al.A dataset for movie description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,2015:3202-3212. [73] TORABI A,PAI C,LAROCHELLE H,et al.Using descriptive video services to create a large data source for video annotation research[J].arXiv:1503.01070,2015. [74] XU J,MEI T,YAO T,et al.MSR-VTT:a large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,2016:5288-5296. [75] ROHRBACH A,TORABI A,ROHRBACH M,et al.Movie description[J].International Journal of Computer Vision,2017:94-120. [76] KISHORE P,SALIM R,TODD W,et al.Bleu:a method for automatic evaluation of machine translation[C]//Annual Meeting of the Association for Computational Linguistics(ACL),Philadelphia,2002:311-318. [77] ALON L,ABHAYA A.Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments[C]//Annual Meeting of the Association for Computational Linguistics(ACL),Prague,2007:228-231. [78] LIN C Y,FRANZ J O.Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics[C]//Annual Meeting of the Association for Computational Linguistics(ACL),Barcelona,2004:605-612. [79] RAMAKRISHNA V C,LAWRENCE Z,DEVI P.Cider:consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,2015:4566-4575. [80] FUJITA S,HIRAO T,KAMIGAITO H,et al.SODA:story oriented dense video captioning evaluation framework[C]//Proceedings of European Conference on Computer Vision(ECCV),Glasgow,2020:517-531. [81] FEI Z C.Memory-augmented image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:1317-1324. [82] XU G,NIU S,TAN M,et al.Towards accurate text-based image captioning with content diversity exploration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:12637-12646. [83] VLADIMIR I,ESA R.Multi-modal dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,Seattle,2020:4117-4126. [84] LEI J,WANG L W,SHEN Y L,et al.MART:memory-augmented recurrent transformer for coherent video paragraph captioning[J].arXiv:2005.05402,2020. [85] ZOLFAGHARI M,SINGH K,BROX T.ECO:efficient convolutional network for online video understanding[C]//Proceedings of the European Conference on Computer Vision,2018:695-712. [86] CHEN L,JIANG Z H,XIAO J,et al.Human-like controllable image captioning with verb-specific semantic roles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:16846-16856. [87] DENG C R,DING N,TAN M K,et al.Length-controllable image captioning[C]//Proceedings of European Conference on Computer Vision(ECCV),Glasgow,2020:712-729. [88] LUO Y,JI J,SUN X,et al.Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:2286-2293. [89] OLEKSII S,HU R H,MARCUS R,et al.Textcaps:a dataset for image captioning with reading comprehension[C]//Proceedings of European Conference on Computer Vision(ECCV),Glasgow,2020:742-758. [90] WANG J,TANG J H,YANG M K,et al.Improving ocr-based image captioning by incorporating geometrical relationship[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:1306-1315. [91] HU X,YIN X,LIN K,et al.Vivo:visual vocabulary pretraining for novel object captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:1575-1583. [92] AGRAWAL H,DESAI K,WANG Y,et al.Nocaps:novel object captioning at scale[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,Seoul,2019:8947-8956. [93] ZHAN R,LIU X,WONG D F,et al.Meta-curriculum learning for domain adaptation in neural machine translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2021:14310-14318. [94] NISHIKAWA S,RI R,TSURUOKA Y.Data augmentation with unsupervised machine translation improves the structural similarity of cross-lingual word embeddings[C]//Annual Meeting of the Association for Computational Linguistics(ACL),2021:163-173. |
[1] | HE Jiafeng, CHEN Hongwei, LUO Dehan. Review of Real-Time Semantic Segmentation Algorithms for Deep Learning [J]. Computer Engineering and Applications, 2023, 59(8): 13-27. |
[2] | WANG Xiaoming, MAO Yushi, XU Bin, WANG Zilei. Content Structure Preserved Image Style Transfer Method [J]. Computer Engineering and Applications, 2023, 59(6): 146-154. |
[3] | XIAO Zhenjiu, LI Xin. Full-Scale Correlation Filtering Tracking Method Based on Density Peak Clustering [J]. Computer Engineering and Applications, 2023, 59(5): 131-139. |
[4] | XIAO Lizhong, ZANG Zhongxing, SONG Saisai. Research on Cascaded Labeling Framework for Relation Extraction with Self-Attention [J]. Computer Engineering and Applications, 2023, 59(3): 77-83. |
[5] | LIN Lingde, LIU Na, WANG Zheng'an. Review of Research on Adapter and Prompt Tuning [J]. Computer Engineering and Applications, 2023, 59(2): 12-21. |
[6] | JING Li, YAO Ke. Research on Text Classification Based on Knowledge Graph and Multimodal [J]. Computer Engineering and Applications, 2023, 59(2): 102-109. |
[7] | LIU Zeyi, YU Wenhua, HONG Zhiyong, KE Guanzhou, TAN Rongjie. Chinese Event Extraction Using Question Answering [J]. Computer Engineering and Applications, 2023, 59(2): 153-160. |
[8] | YANG Dong, TIAN Shengwei, YU Long, ZHOU Tiejun, WANG Bo. Fast Model for Joint Extraction of Entity and Relation [J]. Computer Engineering and Applications, 2023, 59(13): 164-170. |
[9] | ZHENG Bofei, YUN Jing, LIU Limin, JIAO Lei, YUAN Jingshu. Review of Research on Cross-Lingual Summarization [J]. Computer Engineering and Applications, 2023, 59(13): 49-60. |
[10] | YANG Feng, DING Zhitong, XING Mengmeng, DING Bo. Review of Object Detection Algorithm Improvement in Deep Learning [J]. Computer Engineering and Applications, 2023, 59(11): 1-15. |
[11] | DONG Gang, XIE Weicheng, HUANG Xiaolong, QIAO Yitian, MAO Qian. Review of Small Object Detection Algorithms Based on Deep Learning [J]. Computer Engineering and Applications, 2023, 59(11): 16-27. |
[12] | JIANG Zhongmin, ZHANG Wanyan, WANG Wenju. Research of Deep Learning-Based Computational Spectral Imaging for Single RGB Image [J]. Computer Engineering and Applications, 2023, 59(10): 22-34. |
[13] | LI Xiang, ZHANG Tao, ZHANG Zhe, WEI Hongyang, QIAN Yurong. Survey of Transformer Research in Computer Vision [J]. Computer Engineering and Applications, 2023, 59(1): 1-14. |
[14] | FU Miaomiao, DENG Miaolei, ZHANG Dexian. Object Detection Algorithms Based on Deep Learning and Transformer [J]. Computer Engineering and Applications, 2023, 59(1): 37-48. |
[15] | XU Yinxiang, CHEN Qidong, SUN Jun. Text Adversarial Attack Method Applying Based on Improved Quantum Behaved Particle Swarm Optimization [J]. Computer Engineering and Applications, 2022, 58(9): 175-180. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||