基于门机制注意力模型的文本生成图像方法

doi:10.3778/j.issn.1002-8331.2203-0326

摘要/Abstract

摘要： 针对传统文本生成图像方法存在生成图像局部纹理单一、边缘细节不清晰和不符合输入文本描述等问题，提出一种门机制注意力模型的文本生成图像方法RAGAN。针对传统方法无法生成细粒度图像的问题，使用增加门机制的注意力模型网络筛选出相关的词向量，并与中间隐藏向量相结合形成新的隐藏向量，再通过生成对抗网络的相互博弈让生成器生成纹理更加丰富、目标物体边缘更加清晰的图像，从而提高图像质量；针对生成图像不符合输入文本描述的问题，使用文本重构提取生成图像中蕴含的深层次的语义特征，与输入文本的语义特征进行对比，通过定义重构损失提高语义一致性。相比于基准模型，在CUB数据集上的Inception Score与R-precision分别提高了9.17%和8.3%，在COCO数据集上的Inception Score与R-precision分别提高了13.67%和5.56%，证明了该模型在保持语义一致性的同时，有效提高了生成图像的真实性和艺术性。

关键词: 注意力机制, 卷积神经网络, 生成对抗网络, 深度学习

Abstract: Aiming at the problems in text-to-image such as single local texture, unclear edge details and non-conformity to the input text description, RAGAN is proposed as a text-to-image method based on attention model with increased gate mechanism. To address the problem that the traditional method cannot generate fine-grained images, an attention model network with an added gate mechanism is used to filter out relevant word vectors and combine them with intermediate hidden vectors to form new hidden vectors, and then the mutual game of the generative adversarial network allows the generator to generate images with richer textures and clearer edges of the target objects, thus improving the image quality. To address the problem that the generated images do not match the input text description, text reconstruction is used to extract the deep semantic features embedded in the generated images and compare them with the semantic features of the input text to improve the semantic consistency by defining the reconstruction loss. Compared to the baseline model, the Inception Score and R-precision on the CUB dataset improved by 9.17% and 8.3% respectively, and the Inception Score and R-precision on the COCO dataset improved by 13.67% and 5.56% respectively, demonstrating that the model in this paper is effective in improving the authenticity and artistry of the generated images while maintaining semantic consistency.

Key words: attention mechanism network, convolutional neural network, generative adversarial network, deep learning

陈积泽, 姜晓燕, 高永彬. 基于门机制注意力模型的文本生成图像方法[J]. 计算机工程与应用, 2023, 59(12): 208-216.

CHEN Jize, JIANG Xiaoyan, GAO Yongbin. Text-to-Image Method Based on Attention Model with Increased Gate Mechanism[J]. Computer Engineering and Applications, 2023, 59(12): 208-216.

参考文献

[1] LEI J，LI L，ZHOU L，et al.Less is more：clipBERT for video-and-language learning via sparse sampling[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2021：7331-7341.
[2] LIU Y，SHU Z，LI Y，et al.Content-aware GAN compression[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2021：12156-12166.
[3] SHI L，WANG L，LONG C，et al.SGCN：sparse graph convoltionnetwork for pedestrian trajectory prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2021：8994-9003.
[4] HAN L，YIN Z.Transferring microscopy image modalities with conditional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：851-859.
[5] QIAO T，ZHANG J，XU D，et al.MirrorGAN：learning text-to-image generation by redescription[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：1505-1514.
[6] ZHU B，NGO C.CookGAN：causality based text-to-image synthesis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2020：5518-5526.
[7] LAZCANO D，FRANCO N F，CREIXELL W.HGAN：hyperbolic generative adversarial network[J].IEEE Access，2021，9：96309-96320.
[8] LEDIG C，THEIS L，HUSZAR F，et al.Photo-realistic single image super-resolution using a generative adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：105-114.
[9] VAN A，KALCHBRENNER N，ESPEHOLT L，et al.Conditional image generation with pixelCNN decoders[C]//Advances in Neural Information Processing Systems，2016：4790-4798.
[10] NGUYEN A，CLUNE J，BENGIO Y，et al.Plug & play generative networks：conditional iterative generation of images in latent space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：3510-3520.
[11] MA S，FU J，CHEN C W，et al.DA-GAN：instance-level image translation by deep attention generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：5657-5666.
[12] ZHANG X，ZHU X，ZHANG X，et al.SegGAN：semantic segmentation with generative adversarial network[C]//Proceedings of the IEEE International Conference on Multimedia Big Data，2018：1-5.
[13] YIN G，LIU B，SHENG L，et al.Semantics disentangling for text-to-image generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：2327-2336.
[14] GREGOR K，DANIHELKA I，GRAVES A，et al.Draw：a recurrent neural network for image generation[C]//Proceedings of the IEEE Conference on Machine Learning，2015：1462-1471.
[15] VINYALS O，TOSHEV A，BENGIO S，et al.Show and tell：a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3156-3164.
[16] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014：2672-2680.
[17] REED S E，AKATA Z，YAN X，et al.Generative adversarial text to image synthesis[C]//Proceedings of the IEEE Conference on Machine Learning，2016：1060-1069.
[18] REED S E，AKATA Z，MOHAN S，et al.Learning what and where to draw[C]//Advances in Neural Information Processing Systems，2016：217-225.
[19] ZHANG H，XU T，LI H.StackGAN：text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision，2017：5908-5916.
[20] XU T，ZHANG P，HUANG Q，et al.AttnGAN：fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：1316-1324.
[21] SCHUSTER M，PALIWAL K K.Bidirectional recurrent neural networks[J].Proceedings of the IEEE Transaction on Signal Processing，1997，45（11）：2673-2681.
[22] 许一宁，何小海，张津，等.基于多层次分辨率递进生成对抗网络的文本生成图像方法[J].计算机应用，2020，40（12）：3612-3617.
XU Y N，HE X H，ZHANG J，et al.Text-to-image synthesis method based on multi-level progressive resolution generative adversarial networks[J].Journal of Computer Applications，2020，40（12）：3612-3617.
[23] ZHANG H，ZHU H，YANG S，et al.DGattGAN：cooperative up-sampling based dual generator attentional GAN on text-to-image synthesis[J].Proceedings of the IEEE Access，2021，9：29584-29598.
[24] TAN H，LIU X，YIN B，et al.Cross-modal semantic matching generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE Transactions on Multimedia，2022，24：832-845.
[25] 鞠思博，徐晶，李岩芳.基于自注意力机制的文本生成单目标图像方法[J].计算机工程与应用，2022，58（3）：249-258.
JU S B，XU J，LI Y F.Text-to-single image method based on self-attention[J].Computer Engineering and Applications，2022，58（3）：249-258.
[26] ISOLA P，ZHU J Y，ZHOU T，et al.Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：5967-5976.
[27] ZHU J Y，PARK T，ISOLA P，et al.Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision，2017：2242-2251.
[28] YI Z，ZHANG H，TAN P，et al.DualGAN：unsupervised dual learning for image-to-image translation[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2868-2876.
[29] DENG J，DONG W，SOCHER R，et al.ImageNet：a large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2009：248-255.
[30] KINGMA D P，BA J.Adam：amethod for stochastic optimization[C]//Proceedings of the International Conference on Learning Representations，2015.
[31] SALIMANS T，GOODFELLOW I，ZAREMBA W，et al.Improved techniques for training GANs[C]//Advances in Neural Information Processing Systems，2016：2234-2242.
[32] ZHANG H，XU T，LI H，et al.StackGAN++：realistic image synthesis with stacked generative adversarial networks[J].Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence，2019，41（8）：1947-1962.