计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (21): 182-191.DOI: 10.3778/j.issn.1002-8331.2407-0330

• 模式识别与人工智能 • 上一篇    下一篇

基于双Transformer结构的多模态视频段落描述生成研究

赵宏,张立军   

  1. 兰州理工大学 计算机与通信学院,兰州 730050
  • 出版日期:2025-11-01 发布日期:2025-10-31

Research on Multi-Modal Video Paragraph Captioning Based on Dual-Transformer Structure

ZHAO Hong, ZHANG Lijun   

  1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2025-11-01 Published:2025-10-31

摘要: 针对现有视频段落描述方法对视频中主要事件的关注度不足与多事件描述之间缺乏连贯性的问题,在现有编码器-解码器框架的基础上,提出了一种基于双Transformer结构的多模态视频段落描述模型。采用Faster-RCNN对视频中心帧目标进行细粒度特征提取,由混合注意力结合全局视觉特征选择最具代表性的细粒度局部视觉特征,对视频中主要事件信息进行补充与增强,提高视频内容描述的准确性;提出在Transformer结构中增加存储模块与混合注意力模块,并设计了双Transformer结构,内部Transformer对事件内一致性进行建模,外部Transformer由混合注意力计算与当前事件最相关的状态建模事件间的一致性,结合内外部Transformer的输出对事件内容进行预测,提高生成描述语句的连贯性。在ActivityNet Captions数据集和YouCookII数据集上的实验结果表明,所提模型在BLEU-4、METEOR、ROUGE-L和CIDEr指标上相较于现有主流视频段落描述模型有明显提升,验证了模型的有效性。

关键词: 视频段落描述, 编码器-解码器结构, 细粒度局部视觉特征, 双Transformer结构

Abstract: To address the issues of insufficient focus on key events in videos and lack of coherence between multi-event descriptions in existing video paragraph description methods, this paper proposes a multimodal video paragraph description model based on a dual-Transformer structure, built on the existing encoder-decoder framework. The model utilizes Faster-RCNN to extract fine-grained features from keyframes in the video, and a hybrid attention mechanism to combine global visual features, selecting the most representative fine-grained local visual features. The information of key events in the video are enhanced, improving the accuracy of content descriptions. Additionally, the paper introduces a memory module and hybrid attention module into the Transformer structure and designs a dual-Transformer architecture: the internal Transformer models intra-event consistency, while the external Transformer uses hybrid attention to model inter-event consistency by calculating the relevance between the current event and other events. The output of both the internal and external Transformers is combined to predict the content of events, improving the coherence of generated descriptions. Experiments on the ActivityNet Captions and YouCookII datasets demonstrate that the proposed model significantly outperforms existing mainstream video paragraph description models on BLEU-4, METEOR, ROUGE-L, and CIDEr metrics, verifying the effectiveness of the model.

Key words: video paragraph captioning, encoder-decoder framework, fine-grained local visual features, dual-Transformer structure