Improved Transformer Decoding Algorithm for Dense Video Description

doi:10.3778/j.issn.1002-8331.2306-0024

Abstract

Abstract: When applying Transformer for dense video description, historical text features can interfere with subsequent text generation, making it difficult to capture dynamic video information and affecting the coherence and accuracy of the descriptions. To maintain context consistency while mitigating historical text noise, this paper proposes an improved Transformer decoding algorithm for dense video description, called D-Uformer. This algorithm utilizes feedforward neural network (FNN) to enhance the representation of historical text features. It constructs pruning branches to remove redundant information and compensatory branches to enhance contextual information through skip connections, and uses subtraction to reduce the impact of inaccurate descriptions caused by over-focusing on historical text features and improves the model’s attention to input video features. Additionally, it uses addition to compensate for the loss of contextual information during feature transfer, and generates accurate and coherent descriptions of the current video content. Experimental results on the ActivityNet and Charades datasets demonstrate a significant performance improvement of the D-Uformer algorithm. Compared to the temporally descriptive probabilistic captioning (TDPC) network, it achieves a maximum accuracy improvement of 4.816% and a maximum diversity improvement of 4.167%. The generated descriptions not only align better with the video content but also conform more to human language conventions.

Key words: dense video description, Transformer network, decoding, feedforward neural network, skip connection

摘要： 当Transformer应用于密集视频描述时，历史文本特征会对后续文本生成产生干扰，难以捕捉视频动态信息从而影响描述的连贯性和准确性。为保持上下文一致性的同时又能缓解历史文本干扰，提出改进的密集视频描述Transformer译码算法（D-Uformer）。该算法利用前馈神经网络（FNN）增强历史文本特征表达，通过跳跃连接构建删除冗余支路和增强补足支路，利用减法降低历史文本特征过度聚焦导致描述不准确的影响，提高模型对输入视频特征的关注度；同时，利用加法弥补特征传递过程中丢失的上下文信息，生成准确且连贯表达当前视频内容的描述语句。在ActivityNet和Charades数据集上的实验结果表明，D-Uformer算法的描述性能提升明显，与视频多样性描述网络（TDPC）相比，准确性最高提升4.816%，多样性最高提升4.167%，生成的描述不仅更贴合视频内容，且更符合人类语言习惯。

关键词: 密集视频描述, Transformer网络, 译码, 前馈神经网络, 跳跃连接

YANG Dawei, PAN Xiaofang, MAO Lin, ZHANG Rubo. Improved Transformer Decoding Algorithm for Dense Video Description[J]. Computer Engineering and Applications, 2024, 60(17): 89-97.

杨大伟, 盘晓芳, 毛琳, 张汝波. 改进的密集视频描述Transformer译码算法[J]. 计算机工程与应用, 2024, 60(17): 89-97.

References

[1] 毛琳, 高航, 杨大伟, 等. 视频描述中链式语义生成网络[J]. 光学精密工程, 2022, 30(24): 3198-3209.
MAO L, GAO H, YANG D W, et al. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198-3209.
[2] 汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述[J]. 自动化学报, 2022, 48(2): 375-397.
TANG P J, WANG H L. From video to language: survey of video captioning and description[J]. Acta Automatica Sinica, 2022, 48(2): 375-397.
[3] 王林, 白云帆. 基于特征强化与知识补充的视频描述方法[J]. 计算机系统应用, 2023, 32(5): 273-282.
WANG L, BAI Y F. Video description method combining feature reinforcement and knowledge supplementation[J]. Computer Systems & Applications, 2023, 32(5): 273-282.
[4] XIONG Y, DAI B, LIN D. Move forward and tell: a progressive generator of video descriptions[C]//Proceedings of the European Conference on Computer Vision, 2018: 468-483.
[5] PARK J S, ROHRBACH M, DARRELL T, et al. Adversarial inference for multi-sentence video description[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6598-6608.
[6] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 706-715.
[7] ESCORCIA V, HEILBRON F C, NIEBLES J C, et al. DAPs: deep action proposals for action understanding[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 768-784.
[8] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[9] WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7190-7198.
[10] MUN J, YANG L, REN Z, et al. Streamlined dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6588-6597.
[11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
[12] 郭岚. 基于多特征和Transformer的视频语义理解与描述文本生成研究[D]. 兰州: 兰州理工大学, 2022: 17-20.
GUO L. Research on semantic understanding and description text generation of video based on multi-features and transformer[D]. Lanzhou: Lanzhou University of Technology, 2022: 17-20.
[13] 王永. 基于Transformer网络和双向解码的视频描述研究方法[D]. 南昌: 江西师范大学, 2021: 2-23.
WANG Y. Video captioning research method based on Transformer network and bidirectional decoding[D]. Nanchang: Jiangxi Normal University, 2021: 2-23.
[14] ZHOU L, ZHOU Y, CORSO J J, et al. End-to-end dense video captioning with masked transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8739-8748.
[15] LEI J, WANG L, SHEN Y, et al. MART: memory-augmented recurrent transformer for coherent video paragraph captioning[J]. arXiv:2005.05402, 2020.
[16] WANG T, ZHANG R, LU Z, et al. End-to-end dense video captioning with parallel decoding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 6847-6857.
[17] SONG Y, CHEN S, JIN Q. Towards diverse paragraph captioning for untrimmed videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 11245-11254.
[18] WANG X, CHEN W, WU J, et al. Video captioning via hierarchical reinforcement learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4213-4222.
[19] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[20] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation, 2014: 376-380.
[21] VEDANTAM R, ZITNICK L C, PARIKH D. CIDEr: consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4566-4575.
[22] SHETTY R, ROHRBACH M, HENDRICKS A L, et al. Speaking the same language: matching machine to human captions by adversarial training[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 4135-4144.