[1] 黄先开, 张佳玉, 王馨宇, 等. 密集视频描述研究方法综述[J]. 计算机工程与应用, 2023, 59(12): 28-48.
HUANG X K, ZHANG J Y, WANG X Y, et al. Survey of dense video captioning[J]. Computer Engineering and Applications, 2023, 59(12): 28-48.
[2] LI Q, YANG R, XIAO F, et al. Attention-based anomaly detection in multi-view surveillance videos[J]. Knowledge-Based Systems, 2022, 252: 109348.
[3] ZHOU P Y, WANG L, LIU Z, et al. A survey on generative AI and LLM for video generation, understanding, and streaming[J]. arXiv:2404.16038, 2024.
[4] 王雪. 基于智能视频分析的铁路客运站运营态势感知技术及应用[J]. 铁道运输与经济, 2024, 46(8): 144-152.
WANG X. Operational situation awareness technology for railway passenger stations based on intelligent video analysis[J]. Railway Transport and Economy, 2024, 46(8): 144-152.
[5] 李铭兴, 徐成, 李学伟, 等. 基于多模态融合的城市道路场景视频描述模型研究[J]. 计算机应用研究, 2023, 40(2): 607-611.
LI M X, XU C, LI X W, et al. Multimodal fusion for video captioning on urban road scene[J]. Application Research of Computers, 2023, 40(2): 607-611.
[6] TAN C L, LIN Z H, HU J F, et al. Hierarchical semantic correspondence networks for video paragraph grounding[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18973-18982.
[7] VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al. Sequence to sequence: video to text[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2016: 4534-4542.
[8] SHAO Z, HAN J G, DEBATTISTA K, et al. DCMSTRD: end-to-end dense captioning via multi-scale transformer decoding[J]. IEEE Transactions on Multimedia, 2024, 26: 7581-7593.
[9] TANG P J, WANG H L, LI Q Y. Rich visual and language representation with complementary semantics for video captioning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2019, 15(2): 1-23.
[10] YAMAZAKI K, TRUONG S, VO K, et al. VLCap: vision-language with contrastive learning for coherent video paragraph captioning[C]//Proceedings of the 2022 IEEE International Conference on Image Processing. Piscataway: IEEE, 2022: 3656-3661.
[11] KOPUKLU O, KOSE N, GUNDUZ A, et al. Resource efficient 3D convolutional neural networks[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop. Piscataway: IEEE, 2019: 1910-1919.
[12] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[J]. arXiv:2103.00020, 2021.
[13] YU H N, WANG J, HUANG Z H, et al. Video paragraph captioning using hierarchical recurrent neural networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 4584-4593.
[14] ZHOU L W, ZHOU Y B, CORSO J J, et al. End-to-end dense video captioning with masked transformer[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8739-8748.
[15] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 2978-2988.
[16] LEI J, WANG L W, SHEN Y L, et al. MART: memory-augmented recurrent transformer for coherent video paragraph captioning[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 2603-2614.
[17] GING S, ZOLFAGHARI M, PIRSIAVASH H, et al. COOT: cooperative hierarchical transformer for video-text representation learning[J]. arXiv:2011.00597, 2021.
[18] PATRO B, NAMBOODIRI V P. Differential attention for visual question answering[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7680-7688.
[19] LV P, WANG W T, WANG Y X, et al. SSAGCN: social soft attention graph convolution network for pedestrian trajectory prediction[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(9): 11989-12003.
[20] WAJID M S, TERASHIMA-MARIN H, NAJAFIRAD P, et al. Deep learning and knowledge graph for image/video captioning: a review of datasets, evaluation metrics, and methods[J]. Engineering Reports, 2024, 6(1): e12785.
[21] ZHOU X Y, ARNAB A, BUCH S, et al. Streaming dense video captioning[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 18243-18252.
[22] XI Z Y, SHI G, LI X F, et al. Knowledge guided entity-aware video captioning and a basketball benchmark[J]. arXiv:2401.13888, 2024.
[23] WANG T, ZHENG H C, YU M J, et al. Event-centric hierarchical representation for dense video captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(5): 1890-1900.
[24] GU X, CHEN G, WANG Y F, et al. Text with knowledge graph augmented transformer for video captioning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18941-18951.
[25] PRUDVIRAJ J, REDDY M I, VISHNU C, et al. AAP-MIT: attentive atrous pyramid network and memory incorporated transformer for multisentence video description[J]. IEEE Transactions on Image Processing, 2022, 31: 5559-5569.
[26] SEO P H, NAGRANI A, ARNAB A, et al. End-to-end generative pretraining for multimodal video captioning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 17938-17947. |