Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (12): 28-48.DOI: 10.3778/j.issn.1002-8331.2209-0236

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey of Dense Video Captioning

HUANG Xiankai, ZHANG Jiayu, WANG Xinyu, WANG Xiaochuan, LIU Ruijun   

  1. School of Computer Science and Engineering, Beijing Technology and Business University, Beijing 100048, China
  • Online:2023-06-15 Published:2023-06-15

密集视频描述研究方法综述

黄先开,张佳玉,王馨宇,王晓川,刘瑞军   

  1. 北京工商大学 计算机学院,北京 100048

Abstract: Dense video captioning is one kind of video understanding, which bridges computer vision and natural language processing communities. It aims to localize event proposals based on content and describe videos containing rich events into the natural language used by humans for everyday communication. Despite from conventional single-sentence video captioning, the input video of dense video captioning no longer needs to be trimmed for a single event, and the output description text is a description paragraph based on events. Firstly, this paper surveys the basic principles and problems of dense video captioning methods, with which to present the main difficulties and challenges in this field. Secondly, the improvement of the current mainstream methods are elaborated, which are categorized into event proposal, encoding, decoding, adding other auxiliary models and basing on the overall process. Then, this paper summarizes benchmark and evaluation methodology in this field, meanwhile compares the performance of typical methods. Finally, the future directions and prospects of dense video captioning from the aspects of techniques and applications are discussed.

Key words: dense video captioning, video captioning, video understanding, computer vision, natural language processing

摘要: 密集视频描述是视频理解的重要分支之一,也是计算机视觉与自然语言处理领域交叉的热点研究方向。其主要目的是对包含丰富事件的视频进行针对内容的事件定位,并将其描述为人类日常沟通所用的自然语言。与生成单句描述文本的传统视频描述任务相比,密集视频描述的输入视频不再需要进行针对单一事件的裁剪,输出描述文本为针对视频内多个事件的描述段落。简要概述了密集视频描述方法的基本原理及存在问题,并总结了该领域主要面临的研究困难与挑战;对目前主流的密集视频描述方法,依照其对实现流程不同阶段分为基于事件建议、基于编码、基于解码、加入其他辅助模型,以及基于整体流程等五种类别,分别介绍其实现方式及优缺点;对本领域相关数据集以及评价方式进行总结,并对不同方法在相关数据集上的评价结果进行对比;简要讨论密集视频描述技术及其应用的未来发展方向。

关键词: 密集视频描述, 视频描述, 视频理解, 计算机视觉, 自然语言处理