
计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (4): 176-191.DOI: 10.3778/j.issn.1002-8331.2309-0403
晋嘉利,余璐
出版日期:2025-02-15
发布日期:2025-02-14
JIN Jiali, YU Lu
Online:2025-02-15
Published:2025-02-14
摘要: 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势。但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性。目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法。针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法。在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而变化,这种Token即为动态Token,相比于整个训练阶段只定义一个且被所有子任务共用的静态Token而言,动态Token更能保存每个子任务特有的信息和特点。利用这些动态任务Token和任务标识融合特征注意力模块进一步获得具有任务标识信息的融合特征,并在每个子任务训练结束后保存其对应的Token,以保持模型对旧任务的记忆和表达能力,减少模型对旧任务的灾难性遗忘。在MS-COCO和Flickr30k数据集上的实验结果表明,应用动态Token的融合特征的持续图像字幕生成方法在Transformer架构上优于所有基线方法。以CIDEr指标为例,所有训练任务结束后CIDEr指标的平均分数相较于微调和所有基线方法中的最优方法分别提高了31.06%和13.94%。
晋嘉利, 余璐. 应用动态Token的融合特征的持续图像字幕生成[J]. 计算机工程与应用, 2025, 61(4): 176-191.
JIN Jiali, YU Lu. Continual Image Captioning with Dynamic Token-Used Fusion Feature[J]. Computer Engineering and Applications, 2025, 61(4): 176-191.
| [1] 杨文瑞, 沈韬, 朱艳, 等. 融合ELMo词嵌入的多模态Transformer的图像描述算法[J]. 计算机工程与应用, 2022, 58(21): 223-231. YANG W R, SHEN T, ZHU Y, et al. Image caption with ELMo embedding and multimodal Transformer[J]. Computer Engineering and Applications, 2022, 58(21): 223-231. [2] 谢琦彬, 陈平华. 结合全局-局部特征和注意力的图像描述方法[J]. 计算机工程与应用, 2022, 58(12): 218-225. XIE Q B, CHEN P H. Image caption combining global-local features and attention[J]. Computer Engineering and Applications, 2022, 58(12): 218-225. [3] 许昊, 张凯, 田英杰, 等. 深度神经网络图像描述综述[J]. 计算机工程与应用, 2021, 57(9): 9-22. XU H, ZHANG K, TIAN Y J, et al. Review of deep neural network-based image caption[J]. Computer Engineering and Applications, 2021, 57(9): 9-22. [4] BROOKS T, HOLYNSKI A, EFROS A A. InstructPix2Pix: learning to follow image editing instructions[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18392-18402. [5] DONG X, LONG C, XU W, et al. Dual graph convolutional networks with transformer and curriculum learning for image captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 2615-2624. [6] ZHOU Y, ZHANG Y, HU Z, et al. Semi-autoregressive transformer for image captioning[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2021: 3132-3136. [7] MOKADY R, HERTZ A, BERMANO A H. ClipCap: clip prefix for image captioning[J]. arXiv:2111.09734, 2021. [8] YU L, NIKANDROU M, JIN J, et al. Quality-agnostic image captioning to safely assist people with vision impairment[C]//Proceedings of the 32nd International Joint Conference on Artificial Intelligence. New York: ACM, 2023: 6281-6289. [9] 姚光乐, 祝钧桃, 周文龙, 等. 基于特征分布学习的小样本类增量学习[J]. 计算机工程与应用, 2023, 59(14): 151-157. YAO G L, ZHU J T, ZHOU W L, et al. Few-shot class-incremental learning based on feature distribution learning[J]. Computer Engineering and Applications, 2023, 59(14): 151-157. [10] KIRKPATRICK J, PASCANU R, RABINOWITZ N, et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2017, 114(13): 3521-3526. [11] MUNDT M, HONG Y, PLIUSHCH I, et al. A wholistic view of continual learning with deep neural networks: forgotten lessons and the bridge to active and open world learning[J]. Neural Networks, 2023, 160: 306-336. [12] VILLA A, ALCáZAR J L, ALFARRA M, et al. PIVOT: prompting for video continual learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 24214-24223. [13] SMITH J S, TIAN J, HALBE S, et al. A closer look at rehearsal-free continual learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2023: 2410-2420. [14] DOUILLARD A, RAMé A, COUAIRON G, et al. DyTox: transformers for continual learning with dynamic token expansion[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 9275-9285. [15] TIWARI R, KILLAMSETTY K, IYER R, et al. GCR: gradient coreset based replay buffer selection for continual learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 99-108. [16] KE Z, LIU B, MA N, et al. Achieving forgetting prevention and knowledge transfer in continual learning[C]//Advances in Neural Information Processing Systems 34, 2021: 22443-22456. [17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017. [18] QU L, ZHOU Y, LIANG P P, et al. Rethinking architecture design for tackling data heterogeneity in federated learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10051-10061. [19] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020. [20] LIU Y, ZHANG Y, WANG Y, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498. [21] KHAN S, NASEER M, HAYAT M, et al. Transformers in vision: a survey[J]. ACM Computing Surveys, 2022, 54(10S): 200. [22] LI Z, HOIEM D. Learning without forgetting[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12): 2935-2947. [23] ZHENG T, CHEN Z, HUANG B C, et al. MRN: multiplexed routing network for incremental multilingual text recognition[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 18598-18607. [24] ZHANG G, WANG L, KANG G, et al. SLCA: slow learner with classifier alignment for continual learning on a pre-trained model[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 19091-19101. [25] YU L, TWARDOWSKI B, LIU X, et al. Semantic drift compensation for class-incremental learning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 6980-6989. [26] YU L, LIU X, VAN DE WEIJER J. Self-training for class-incremental semantic segmentation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(11): 9116-9127. [27] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755. [28] PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2641-2649. [29] SOCHER R, LI F F. Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2010: 966-973. [30] YAO B Z, YANG X, LIN L, et al. I2T: image parsing to text description[J]. Proceedings of the IEEE, 2010, 98(8): 1485-1508. [31] KARPATHY A, JOULIN A, LI F F. Deep fragment embeddings for bidirectional image sentence mapping[C]//Advances in Neural Information Processing Systems 27, 2014: 1889-1897. [32] LEBRET R, PINHEIRO P O, COLLOBERT R. Phrase-based image captioning[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 2085-2094. [33] XU H, YAN M, LI C, et al. E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 503-513. [34] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137. [35] YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 711-727. [36] LUO J, LI Y, PAN Y, et al. Semantic-conditional diffusion networks for image captioning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 23359-23368. [37] DUBEY S, OLIMOV F, RAFIQUE M A, et al. Label-attention transformer with geometrically coherent objects for image captioning[J]. Information Sciences, 2023, 623: 812-831. [38] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1179-1195. [39] CHEN A, HUANG X, LIN H, et al. Towards annotation-free evaluation of cross-lingual image captioning[C]//Proceedings of the 2nd ACM International Conference on Multimedia in Asia. New York: ACM, 2021: 1-7. [40] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3156-3164. [41] XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 2048-2057. [42] YANG Z, YUAN Y, WU Y, et al. Review networks for caption generation[C]//Advances in Neural Information Processing Systems 29, 2016. [43] HAFETH D A, KOLLIAS S, GHAFOOR M. Semantic representations with attention networks for boosting image captioning[J]. IEEE Access, 2023, 11: 40230-40239. [44] FEI Z. Attention-aligned transformer for image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 607-615. [45] KUO C W, KIRA Z. HAAV: hierarchical aggregation of augmented views for image captioning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 11039-11049. [46] WANG Y, XU J, SUN Y. End-to-end transformer based model for image captioning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2585-2594. [47] REBUFFI S A, KOLESNIKOV A, SPERL G, et al. iCaRL: incremental classifier and representation learning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5533-5542. [48] BUZZEGA P, BOSCHINI M, PORRELLO A, et al. Dark experience for general continual learning: a strong, simple baseline[C]//Advances in Neural Information Processing Systems 33, 2020: 15920-15930. [49] LIN H, ZHANG B, FENG S, et al. PCR: proxy-based contrastive replay for online class-incremental continual learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 24246-24255. [50] SAHA G, ROY K. Saliency guided experience packing for replay in continual learning[C]//Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2023: 5262-5272. [51] SAHA G, GARG I, ANKIT A, et al. SPACE: structured compression and sharing of representational space for continual learning[J]. IEEE Access, 2021, 9: 150480-150494. [52] JIN H, KIM E. Helpful or harmful: inter-task association in continual learning[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 519-535. [53] GURBUZ M B, DOVROLIS C. NISPA: neuro-inspired stability-plasticity adaptation for continual learning in sparse networks[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 8157-8174. [54] GOMEZ-VILLA A, TWARDOWSKI B, YU L, et al. Continually learning self-supervised representations with projected functional regularization[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2022: 3866-3876. [55] KONG Y, LIU L, CHEN H, et al. Overcoming catastrophic forgetting in continual learning by exploring eigenvalues of hessian matrix[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(11): 16196-16210. [56] RUDNER T G J, SMITH F B, FENG Q, et al. Continual learning via sequential function-space variational inference[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 18871-18887. [57] LEE S W, KIM J H, JUN J, et al. Overcoming catastrophic forgetting by incremental moment matching[C]//Advances in Neural Information Processing Systems 30, 2017. [58] SRINIVASAN T, CHANG T Y, PINTO ALVA L, et al. Climb: a continual learning benchmark for vision-and-language tasks[C]//Advances in Neural Information Processing Systems 35, 2022: 29440-29453. [59] ZHANG X, ZHANG F, XU C. VQACL: a novel visual question answering continual learning setting[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19102-19112. [60] LEI S W, GAO D, WU J Z, et al. Symbolic replay: scene graph as prompt for continual learning on VQA task[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(1): 1250-1259. [61] SMITH J S, CASCANTE-BONILLA P, ARBELLE A, et al. ConStruct-VL: data-free continual structured VL concepts learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14994-15004. [62] DEL CHIARO R, TWARDOWSKI B, BAGDANOV A D, et al. RATT: recurrent attention to transient tasks for continual image captioning[C]//Advances in Neural Information Processing Systems 33, 2020: 16736-16748. [63] ZHAI S, TALBOTT W, SRIVASTAVA N, et al. An attention free transformer[J]. arXiv:2105.14103, 2021. [64] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3): 211-252. [65] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. [66] KINGMA D P, BA J, HAMMAD M M. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014. [67] SHARMA S, EL ASRI L, SCHULZ H, et al. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation[J]. arXiv:1706.09799, 2017. [68] HUSZáR F. Note on the quadratic penalties in elastic weight consolidation[J]. Proceedings of the National Academy of Sciences of the United States of America, 2018, 115(11): E2496-E2497. [69] WANG Z, ZHANG Z, LEE C Y, et al. Learning to prompt for continual learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 139-149. [70] WANG Z, LIU L, DUAN Y, et al. Continual learning with lifelong vision transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 171-181. |
| [1] | 李鑫, 张丹, 郭新, 汪松, 陈恩庆. 基于CNN和Transformer双流融合的人体姿态估计[J]. 计算机工程与应用, 2025, 61(5): 187-199. |
| [2] | 黄山, 范慧杰, 林森, 曹镜涵, 唐延东. 基于扩散方法的特征动态库[J]. 计算机工程与应用, 2025, 61(5): 241-249. |
| [3] | 王炜航, 张轶. MLDAC:多任务密集注意计算自监督小样本分割方法[J]. 计算机工程与应用, 2025, 61(4): 211-221. |
| [4] | 潘惟兰, 张荣芬, 刘宇红, 张吉友, 孙龙. 结合CNN-Transformer的跨模态透明物体分割[J]. 计算机工程与应用, 2025, 61(4): 222-229. |
| [5] | 冯星宇, 朱灵龙, 张永宏, 阚希, 曹海啸, 马光义. 基于多边特征引导聚合网络的变化检测算法[J]. 计算机工程与应用, 2025, 61(3): 264-274. |
| [6] | 韦超, 钱春雨, 黄启鹏, 杜林轩, 杨哲. 基于YOLOv8n的表格线检测改进模型[J]. 计算机工程与应用, 2025, 61(2): 112-123. |
| [7] | 康宇, 郝晓丽. 联合判别区域特征的细粒度视觉分类方法[J]. 计算机工程与应用, 2025, 61(2): 227-233. |
| [8] | 何光, 吴田军. 三维卷积与Transformer支持下联合空谱特征的高光谱影像分类[J]. 计算机工程与应用, 2025, 61(2): 259-272. |
| [9] | 李飞翔, 降爱莲. MSMVT:多尺度和多视图Transformer半监督医学图像分割框架[J]. 计算机工程与应用, 2025, 61(2): 273-282. |
| [10] | 姜贸翔, 司占军, 王晓喆. 改进RT-DETR的无人机图像目标检测算法[J]. 计算机工程与应用, 2025, 61(1): 98-108. |
| [11] | 薛紫涵, 葛海波, 王淑贤, 安玉, 杨雨迪. 融合快速边缘注意力的Transformer跟踪算法[J]. 计算机工程与应用, 2025, 61(1): 221-231. |
| [12] | 黄科迪, 黄鹤鸣, 李伟, 樊永红. DCaT:面向高分辨率场景的轻量级语义分割模型[J]. 计算机工程与应用, 2025, 61(1): 252-262. |
| [13] | 颜建强, 张霖, 高原, 李银香. 城市交通拥堵预测研究:以西安市为例[J]. 计算机工程与应用, 2025, 61(1): 330-340. |
| [14] | 刘世鹏, 宁德军, 马崛. 针对光伏发电功率预测的LSTformer模型[J]. 计算机工程与应用, 2024, 60(9): 317-325. |
| [15] | 王茹, 刘大明, 张健. Wear-YOLO:变电站电力人员安全装备检测方法研究[J]. 计算机工程与应用, 2024, 60(9): 111-121. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||