[1] 毛琳, 高航, 杨大伟, 等. 视频描述中链式语义生成网络[J]. 光学精密工程, 2022, 30(24): 3198-3209.
MAO L, GAO H, YANG D W, et al. Chained semantic generation network for video captioning[J]. Optics and Precision Engineering, 2022, 30(24): 3198-3209.
[2] 汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述[J]. 自动化学报, 2022, 48(2): 375-397.
TANG P J, WANG H L. From video to language: survey of video captioning and description[J]. Acta Automatica Sinica, 2022, 48(2): 375-397.
[3] 王林, 白云帆. 基于特征强化与知识补充的视频描述方法[J]. 计算机系统应用, 2023, 32(5): 273-282.
WANG L, BAI Y F. Video description method combining feature reinforcement and knowledge supplementation[J]. Computer Systems & Applications, 2023, 32(5): 273-282.
[4] XIONG Y, DAI B, LIN D. Move forward and tell: a progressive generator of video descriptions[C]//Proceedings of the European Conference on Computer Vision, 2018: 468-483.
[5] PARK J S, ROHRBACH M, DARRELL T, et al. Adversarial inference for multi-sentence video description[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6598-6608.
[6] KRISHNA R, HATA K, REN F, et al. Dense-captioning events in videos[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 706-715.
[7] ESCORCIA V, HEILBRON F C, NIEBLES J C, et al. DAPs: deep action proposals for action understanding[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 768-784.
[8] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[9] WANG J, JIANG W, MA L, et al. Bidirectional attentive fusion with context gating for dense video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7190-7198.
[10] MUN J, YANG L, REN Z, et al. Streamlined dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6588-6597.
[11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.
[12] 郭岚. 基于多特征和Transformer的视频语义理解与描述文本生成研究[D]. 兰州: 兰州理工大学, 2022: 17-20.
GUO L. Research on semantic understanding and description text generation of video based on multi-features and transformer[D]. Lanzhou: Lanzhou University of Technology, 2022: 17-20.
[13] 王永. 基于Transformer网络和双向解码的视频描述研究方法[D]. 南昌: 江西师范大学, 2021: 2-23.
WANG Y. Video captioning research method based on Transformer network and bidirectional decoding[D]. Nanchang: Jiangxi Normal University, 2021: 2-23.
[14] ZHOU L, ZHOU Y, CORSO J J, et al. End-to-end dense video captioning with masked transformer[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8739-8748.
[15] LEI J, WANG L, SHEN Y, et al. MART: memory-augmented recurrent transformer for coherent video paragraph captioning[J]. arXiv:2005.05402, 2020.
[16] WANG T, ZHANG R, LU Z, et al. End-to-end dense video captioning with parallel decoding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 6847-6857.
[17] SONG Y, CHEN S, JIN Q. Towards diverse paragraph captioning for untrimmed videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 11245-11254.
[18] WANG X, CHEN W, WU J, et al. Video captioning via hierarchical reinforcement learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4213-4222.
[19] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002: 311-318.
[20] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation, 2014: 376-380.
[21] VEDANTAM R, ZITNICK L C, PARIKH D. CIDEr: consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 4566-4575.
[22] SHETTY R, ROHRBACH M, HENDRICKS A L, et al. Speaking the same language: matching machine to human captions by adversarial training[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 4135-4144. |