[1] 宋井宽, 曾鹏鹏, 顾嘉扬, 等. 基于视觉区域聚合与双向协作的端到端图像描述生成[J]. 软件学报, 2022, 34(5): 2152-2169.
SONG J K, ZENG P P, GU J Y, et al. End-to-end image captioning via visual region aggregation and dual-level collaboration[J]. Journal of Software, 2023, 34(5): 2152-2169.
[2] 卓亚琦, 魏家辉, 李志欣. 基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50(5): 1123-1130.
ZHUO Y Q, WEI J H, LI Z X. Research on image captioning based on double attention model[J]. Acta Electronica Sinica, 2022, 50(5): 1123-1130.
[3] 魏博文, 全红艳. 基于语义与形态特征融合的语义分割网络[J]. 电子学报, 2022, 50(11): 2688-2697.
WEI B W, QUAN H Y. Semantic segmentation network based on semantic and morphological feature fusion[J]. Acta Electronica Sinica, 2022, 50(11): 2688-2697.
[4] STEFANINI M, CORNIA M, BARALDI L, et al. From show to tell: a survey on deep learning-based image captioning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 539-559.
[5] 刘浩阳, 林耀进, 刘景华, 等. 由粗到细的分层特征选择[J]. 电子学报, 2022, 50(11): 2778-2789.
LIU H Y, LIN Y J, LIU J H, et al. Hierarchical feature selection from coarse to fine[J]. Acta Electronica Sinica, 2022, 50(11): 2778-2789.
[6] ANEJA J, AGRAWAL H, BATRA D, et al. Sequential latent spaces for modeling the intention during diverse image captioning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2019: 4261-4270.
[7] DAI B, FIDLER S, URTASUN R, et al. Towards diverse and natural image descriptions via a conditional GAN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2017: 2970-2979.
[8] SHI J, LI Y, WANG S. Partial off-policy learning: balance accuracy and diversity for human-oriented image captioning[C]//Proceedings of the 2021 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2021: 2187-2196.
[9] WANG Q Z, WAN J, CHAN A B. On diversity in image captioning: metrics and methods[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(2):1035-1049.
[10] ZHANG X, SUN X, LUO Y, et al. Rstnet: captioning with adaptive attention on visual and non-visual words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 15465-15474.
[11] ZHOU L W, PALANGI H, ZHANG L, et al. Unified vision-language pre-training for image captioning and VQA[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 13041-13049.
[12] LI X J, YIN X, LI C Y, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[J]. arXiv:2004.06165, 2020.
[13] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[14] YANG X, ZHANG H W, CAI J F. Deconfounded image captioning: a causal retrospect[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12996-13010.
[15] VIJAYAKUMAR A, COGSWELL M, SELVARAJU R, et al. Diverse beam search for improved description of complex scenes[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Melbourne: AAAI Press, 2018: 7371-7379.
[16] DESHPANDE A, ANEJA J, WANG L W, et al. Fast, diverse and accurate image captioning guided by part-of-speech[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2019: 10695-10704.
[17] WANG L, SCHWING A, LAZEBNIK S. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc, 2017:5756-5766.
[18] MAHAJAN S, GUREVYCH I, ROTH S. Latent normalizing flows for many-to-many cross-domain mappings[C]//Proceedings of Conference on Learning Representations, 2020.
[19] MAHAJAN S, ROTH S. Diverse image captioning with context-object split latent spaces[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc, 2020:3613-3624.
[20] XU J, LIU B, ZHOU Y, LIU M M, et al. Diverse image captioning via conditional variational autoencoder and dual contrastive learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 20(1): 1-16. |