Diverse Image Captioning via Conditional Variational Transformer and Introspective Adversarial Learning

doi:10.3778/j.issn.1002-8331.2402-0065

Abstract

Abstract: In recent years, Transformer based image description generation models have achieved significant performance improvements. However, the performance of existing methods heavily rely on predefined metrics or cross entropy loss, making it difficult to generate natural and diverse descriptive statements. To this end, a sequence variational Transformer model is first introduced for generating diverse image descriptions. Then, the variational lower bound of mutual information within image modalities is maximized to alleviate the problem of pattern collapse. Finally, this paper maximizes the mutual information between image text multimodality, seamlessly integrates the sequence variational Transformer model with contrastive learning, further enhances the representation learning ability of the sequence variational encoder and promotes the generation of diverse descriptions. Quantitative and qualitative experiments are conducted on the MSCOCO standard dataset. When randomly generating 100 descriptive statements, compared with the current optimal results, the accuracy index CIDEr (consensus-based image description evaluation) is improved by 5.5%, and the diversity index Div-2 (2-gram Diversity) is improved by 10.5%, respectively.

Key words: image understanding, image captioning, contrastive learning, mutual information

摘要： 近年来，基于Transformer的图像描述生成模型取得了显著的性能提升。然而，现有方法性能严重依赖预定义的指标或交叉熵损失，难以生成自然而多样的描述语句。引入一种序列变分Transformer模型，用于多样化图像描述生成。通过最大化图像模态内互信息的变分下界，缓解模式坍塌问题。最大化图像-文本多模态间的互信息，将序列变分Transformer模型与对比学习无缝集成，从而进一步增强序列变分编码器的表征学习能力，促进多样化描述的生成。在MSCOCO标准数据集上进行了定量和定性实验，在随机生成100个描述语句时，与当前最优结果相比，准确性指标CIDEr（consensus-based image description evaluation）提升了5.5%，多样性指标Div-2（2-gram diversity）提升了10.5%。

关键词: 图像理解, 图像描述, 对比学习, 互信息

LIU Mingming, LIU Bing, LIU Hao, ZHANG Haiyan. Diverse Image Captioning via Conditional Variational Transformer and Introspective Adversarial Learning[J]. Computer Engineering and Applications, 2024, 60(21): 164-171.

刘明明, 刘兵, 刘浩, 张海燕. 融合序列变分Transformer与对比学习的多样化图像描述生成[J]. 计算机工程与应用, 2024, 60(21): 164-171.

References

[1] 宋井宽, 曾鹏鹏, 顾嘉扬, 等. 基于视觉区域聚合与双向协作的端到端图像描述生成[J]. 软件学报, 2022, 34(5): 2152-2169.
SONG J K, ZENG P P, GU J Y, et al. End-to-end image captioning via visual region aggregation and dual-level collaboration[J]. Journal of Software, 2023, 34(5): 2152-2169.
[2] 卓亚琦, 魏家辉, 李志欣. 基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50(5): 1123-1130.
ZHUO Y Q, WEI J H, LI Z X. Research on image captioning based on double attention model[J]. Acta Electronica Sinica, 2022, 50(5): 1123-1130.
[3] 魏博文, 全红艳. 基于语义与形态特征融合的语义分割网络[J]. 电子学报, 2022, 50(11): 2688-2697.
WEI B W, QUAN H Y. Semantic segmentation network based on semantic and morphological feature fusion[J]. Acta Electronica Sinica, 2022, 50(11): 2688-2697.
[4] STEFANINI M, CORNIA M, BARALDI L, et al. From show to tell: a survey on deep learning-based image captioning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 539-559.
[5] 刘浩阳, 林耀进, 刘景华, 等. 由粗到细的分层特征选择[J]. 电子学报, 2022, 50(11): 2778-2789.
LIU H Y, LIN Y J, LIU J H, et al. Hierarchical feature selection from coarse to fine[J]. Acta Electronica Sinica, 2022, 50(11): 2778-2789.
[6] ANEJA J, AGRAWAL H, BATRA D, et al. Sequential latent spaces for modeling the intention during diverse image captioning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2019: 4261-4270.
[7] DAI B, FIDLER S, URTASUN R, et al. Towards diverse and natural image descriptions via a conditional GAN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Piscataway: IEEE, 2017: 2970-2979.
[8] SHI J, LI Y, WANG S. Partial off-policy learning: balance accuracy and diversity for human-oriented image captioning[C]//Proceedings of the 2021 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2021: 2187-2196.
[9] WANG Q Z, WAN J, CHAN A B. On diversity in image captioning: metrics and methods[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(2):1035-1049.
[10] ZHANG X, SUN X, LUO Y, et al. Rstnet: captioning with adaptive attention on visual and non-visual words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 15465-15474.
[11] ZHOU L W, PALANGI H, ZHANG L, et al. Unified vision-language pre-training for image captioning and VQA[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 13041-13049.
[12] LI X J, YIN X, LI C Y, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[J]. arXiv:2004.06165, 2020.
[13] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[14] YANG X, ZHANG H W, CAI J F. Deconfounded image captioning: a causal retrospect[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12996-13010.
[15] VIJAYAKUMAR A, COGSWELL M, SELVARAJU R, et al. Diverse beam search for improved description of complex scenes[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Melbourne: AAAI Press, 2018: 7371-7379.
[16] DESHPANDE A, ANEJA J, WANG L W, et al. Fast, diverse and accurate image captioning guided by part-of-speech[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE, 2019: 10695-10704.
[17] WANG L, SCHWING A, LAZEBNIK S. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc, 2017:5756-5766.
[18] MAHAJAN S, GUREVYCH I, ROTH S. Latent normalizing flows for many-to-many cross-domain mappings[C]//Proceedings of Conference on Learning Representations, 2020.
[19] MAHAJAN S, ROTH S. Diverse image captioning with context-object split latent spaces[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc, 2020:3613-3624.
[20] XU J, LIU B, ZHOU Y, LIU M M, et al. Diverse image captioning via conditional variational autoencoder and dual contrastive learning[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 20(1): 1-16.