计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (11): 211-218.DOI: 10.3778/j.issn.1002-8331.2003-0331

• 图形图像处理 • 上一篇    下一篇

结合局部奖励机制的视频摘要技术研究

梅锋,周娟平,陆璐   

  1. 1.广东省广播电视网络股份有限公司中山分公司,广东 中山 528403
    2.华南理工大学 计算机科学与工程学院,广州 510006
  • 出版日期:2021-06-01 发布日期:2021-05-31

Research on Video Summarization Technology Combining Local Reward Mechanism

MEI Feng, ZHOU Juanping, LU Lu   

  1. 1.Zhongshan Branch of Guangdong Broadcast &Video Network Co., Ltd., Zhongshan, Guangdong 528403, China
    2.School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
  • Online:2021-06-01 Published:2021-05-31

摘要:

视频摘要技术的目的是在缩短视频长度的同时,概括视频的主要内容,这样可以极大地节省人们浏览视频的时间。视频摘要技术的一个关键步骤是评估生成摘要的性能,现有的大多数方法是基于整个视频进行评估。然而,基于整个视频序列进行评估的计算成本很高,特别是对于长视频。而且在整个视频上评估生成摘要往往忽略了视频数据固有的时序关系,导致生成摘要缺乏故事情节的逻辑性。因此,提出了一个关注局部信息的视频摘要网络,称为自注意力和局部奖励视频摘要网络(ALRSN)。确切地说,该模型采用自注意力机制预测视频帧的重要性分数,然后通过重要性分数生成视频摘要。为了评估生成摘要的性能,进一步设计了一个局部奖励函数,同时考虑了视频摘要的局部多样性和局部代表性。该函数将生成摘要映射回原视频,并在局部范围内评估摘要的性能,使其具有原视频的时序结构。通过在局部范围内获得更高的奖励分数,使模型生成更多样化、更具代表性的视频摘要。综合实验表明,在两个基准数据集SumMe和TvSum上,ALRSN模型优于现有方法。

关键词: 计算机视觉, 视频摘要, 注意力机制, 局部奖励函数

Abstract:

Video summarization aims to shorten the length of the video while preserving the main content, eminently saving time of browsing videos. A key step of video summarization is to evaluate the performance of generated summaries, whereas most existing methods focus on evaluating it based on the whole video. However, evaluation based on the entire video sequence is computationally expensive, especially for long videos. Moreover, the evaluation of the generated summary on the entire video often ignores the inherent temporal relationship of the video data, which leads to the lack of logic of the storyline. It thereby proposes a novel framework for video summarization called Attentive Local Reward Summarization Network(ALRSN). To be precise, the model performs frame-level important score predictions through a self-attention mechanism. To evaluate the performance of generated summaries, it further designs a local reward function that jointly accounts for both the local diversity and local representativeness. The generated summary maps to the original video and evaluates the performance in a local scope, therefore it has the temporal relationship. In addition, the local reward function encourages the model to produce a more diverse and representative summary in the local scope, thereby obtaining a higher reward. The comprehensive experiment on two benchmark datasets, SumMe and TvSum, shows that the ALRSN model is superior to the state-of-the-art methods.

Key words: computer vision, video summarization, attention mechanism, local reward function