[1] WANG D F, HU H F, CHEN D H. Transformer with sparse self-attention mechanism for image captioning[J]. Electronics Letters, 2020, 56(15): 764-766.
[2] LI Z X, LIN L, ZHANG C L, et al. A semi-supervised learning approach based on adaptive weighted fusion for automatic image annotation[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2021, 17(1): 1-23.
[3] HUANG L, WANG W M, CHEN J, et al. Attention on attention for image captioning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 4634-4643.
[4] WU M R, ZHANG X Y, SUN X S, et al. DIFNet: boosting visual information flow for image captioning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18020-18029.
[5] YANG X, ZHANG H W, GAO C Y, et al. Learning to collocate visual-linguistic neural modules for image captioning[J]. International Journal of Computer Vision, 2023, 131(1): 82-100.
[6] CHEN F L, ZHANG D Z, HAN M L, et al. VLP: a survey on vision-language pre-training[J]. Machine Intelligence Research, 2023, 20(1): 38-56.
[7] 杨文瑞, 沈韬, 朱艳, 等. 融合ELMo词嵌入的多模态Transformer的图像描述算法[J]. 计算机工程与应用, 2022, 58(21): 223-231.
YANG W R, SHEN T, ZHU Y, et al. Image caption with ELMo embedding and multimodal transformer[J]. Computer Engineering and Applications, 2022, 58(21): 223-231.
[8] GUO L T, LIU J, ZHU X X, et al. Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10327-10336.
[9] XU S, LI Y J, LIN M B, et al. Q-DETR: an efficient low-bit quantized detection transformer[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 3842-3851.
[10] KAMANGAR Z U, SHAIKH G M, HASSAN S, et al. Image caption generation related to object detection and colour recognition using transformer-decoder[C]//Proceedings of the 2023 4th International Conference on Computing, Mathematics and Engineering Technologies. Piscataway: IEEE, 2023: 1-5.
[11] ORIOL V, ALEXANDER T, SAMY B, et al. Show and tell: a neural image caption generator[J]. arXiv:1411.4555, 2014.
[12] MARZOUK R, ALABDULKREEM E, NOUR M K, et al. Natural language processing with optimal deep learning-enabled intelligent image captioning system[J]. Computers, Materials & Continua, 2023, 74(2): 4435-4451.
[13] XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. New York: ACM, 2015: 2048-2057.
[14] 刘茂福, 施琦, 聂礼强. 基于视觉关联与上下文双注意力的图像描述生成方法[J]. 软件学报, 2022, 33(9): 3210-3222.
LIU M F, SHI Q, NIE L Q. Image captioning based on visual relevance and context dual attention[J]. Journal of Software, 2022, 33(9): 3210-3222.
[15] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086.
[16] DU S, ZHU H, LIN G F, et al. Object semantic analysis for image captioning[J]. Multimedia Tools and Applications, 2023, 82(28): 43179-43206.
[17] 卓亚琦, 魏家辉, 李志欣. 基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50(5): 1123-1130.
ZHUO Y Q, WEI J H, LI Z X. Research on image captioning based on double attention model[J]. Acta Electronica Sinica, 2022, 50(5): 1123-1130.
[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. arXiv:1706.03762, 2017.
[19] RAUNAK V, MENEZES A, POST M, et al. Do GPTs produce less literal translations?[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 1041-1050.
[20] HEWITT J, THICKSTUN J, MANNING C, et al. Backpack language models[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 9103-9125.
[21] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[22] GAO S Y, ZHOU C L, ZHANG J. Generalized relation modeling for transformer tracking[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18686-18695.
[23] 季瑞瑞, 谢宇辉, 骆丰凯, 等. 改进视觉Transformer的人脸识别方法[J]. 计算机工程与应用, 2023, 59(8): 117-126.
JI R R, XIE Y H, LUO F K, et al. Face recognition method based on improved visual Transformer[J]. Computer Engineering and Applications, 2023, 59(8): 117-126.
[24] LIU X Y, PENG H W, ZHENG N X, et al. EfficientViT: memory efficient vision transformer with cascaded group attention[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14420-14430.
[25] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1179-1195.
[26] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[27] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137.
[28] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. New York: ACM, 2002: 311-318.
[29] LIN C Y. Automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out, 2004.
[30] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 4566-4575.
[31] 唐渔, 何志琴, 周宇辉, 等. 基于Se-ResNet50特征编码器的公共环境图像描述生成[J]. 计算机应用研究, 2023, 40(6): 1864-1869.
TANG Y, HE Z Q, ZHOU Y H, et al. Public environment image caption generation based on Se-ResNet-50 feature encoder[J]. Application Research of Computers, 2023, 40(6): 1864-1869.
[32] HERDADE S, KAPPELER A, BOAKYE K, et al. Image caption: transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 11137-11147.
[33] ZHANG J, LI K K, WANG Z. Parallel-fusion LSTM with synchronous semantic and visual information for image captioning[J]. Journal of Visual Communication and Image Representation, 2021, 75: 103044.
[34] ZHANG Z J, WU Q, WANG Y, et al. Exploring region relationships implicitly: image captioning with visual relationship attention[J]. Image and Vision Computing, 2021, 109: 104146.
[35] PEI H L, CHEN Q H, WANG J, et al. Visual relational reasoning for image caption[C]//Proceedings of the 2020 International Joint Conference on Neural Networks. Piscataway: IEEE, 2020: 1-8. |