WEI Yuqi, LI Ning. Cross-Modal Information Interaction Reasoning Network for Image and Text Retrieval[J]. Computer Engineering and Applications, 2023, 59(16): 115-124.
[1] UPPAL S,BHAGAT S,HAZARIKA D,et al.Multimodal research in vision and language:a review of current and emerging trends[J].Information Fusion,2022,77:149-171.
[2] KAUR P,PANNU H S,MALHI A K.Comparative analysis on cross-modal information retrieval:a review[J].Computer Science Review,2021,39(2):100336.
[3] SOCHER R,LI F F.Connecting modalities:semi-supervised segmentation and annotation of images using unaligned text corpora[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,2010.
[4] WANG B,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Conference on Multimedia,2017:154-162.
[5] FAGHRI F,FLEET D J,KIROS J R,et al.VSE++:improving visual-semantic embeddings with hard negatives[J].arXiv:05612,2017.
[6] GU J,CAI J,JOTY S,et al.Look,imagine and match:improving textual-visual cross-modal retrieval with generative models[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018.
[7] KOU F,DU J,CUI W,et al.Common semantic representation method based on object attention and adversarial learning for cross-modal data in IoV[J].IEEE Transactions on Vehicular Technology,2019,68(12):11588-11598.
[8] EISENSCHTAT A,WOLF L.Linking image and text with 2-way nets[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition,2017.
[9] WANG S,WANG R,YAO Z,et al.Cross-modal scene graph matching for relationship-aware image-text retrieval[C]//Proceedings of the 2020 IEEE/CVF Winter Conference on Applications of Computer Vision,2020.
[10] 李志欣,凌锋,马慧芳,等.融合两级相似度的跨媒体图像文本检索[J].电子学报,2021,49(2):268-274.
LI Z X,LING F,MA H F,et al.Cross-media image-text retrieval with two level similarity[J].Acta Electronica Sinica,2021,49(2):268-274.
[11] SONG Y,SOLEYMANI M.Polysemous visual-semantic embedding for cross-modal retrieval[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:1979-1988.
[12] SHI L,DU J,CHENG G,et al.Cross-media search method based on complementary attention and generative adversarial network for social networks[J].International Journal of Intelligent Systems,202,37(8):4393-4416.
[13] MESSINA N,AMATO G,ESULI A,et al.Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J].ACM Transactions on Multimedia Computing,Communications and Applications,2021,17(4):1-23.
[14] WANG Z,LIU X,LI H,et al.CAMP:cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision,2019.
[15] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision,2014.
[16] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:new similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78.
[17] CHEN H,DING G,LIU X,et al.IMRAM:iterative matching with recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020.
[18] WANG L,LI Y,LAZEBNIK S.Learning deep structure-preserving image-text embeddings[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition,2016.
[19] ZHANG Y,LU H.Deep cross-modal projection learning for image-text matching[C]//Proceedings of the 15th European Conference on Computer Vision,2018.
[20] LI K,ZHANG Y,LI K,et al.Visual semantic reasoning for image-text matching[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision,2019.
[21] QU L,LIU M,CAO D,et al.Context-aware multi-view summarization network for image-text matching[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020.
[22] AKBARI H,KARAMAN S,BHARGAVA S,et al.Multi-level multimodal common semantic space for image-phrase grounding[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019.
[23] NIU Z,ZHONG G,YU H.A review on the attention mechanism of deep learning[J].Neurocomputing,2021,452:48-62.
[24] YANG Z,HE X,GAO J,et al.Stacked attention networks for image question answering[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition,2016.
[25] XU K,BA J,KIROS R,et al.Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning,2015.
[26] YE L,ROCHAN M,LIU Z,et al.Cross-modal self-attention network for referring image segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019.
[27] LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the 15th European Conference on Computer Vision,2018.
[28] DIAO H,ZHANG Y,MA L,et al.Similarity reasoning and filtration for image-text matching[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021.
[29] STEFANINI M,CORNIA M,BARALDI L,et al.A novel attention-based aggregation function to combine vision and language[C]//Proceedings of the 2020 25th International Conference on Pattern Recognition,2021.
[30] 刘茂福,施琦,聂礼强.基于视觉关联与上下文双注意力的图像描述生成方法[J].软件学报,2022,33(9):3210-3222.
LIU M,SHI Q,NIE L.Image description generation method based on visual correlation and contextual attention[J].Journal of Software,2022,33(9):3210-3222.
[31] ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition,2018.
[32] KRISHNA R,ZHU Y,GROTH O,et al.Visual genome:connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision,2017,123(1):32-73.
[33] KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition,2015.
[34] KINGMA D P,BA J J A P A.Adam:a method for stochastic optimization[C]//Proceedings of the 3rd International Conference for Learning Representations,San Diego,2015.
[35] JI Z,WANG H,HAN J,et al.SMAN:stacked multimodal attention network for cross-modal image-text retrieval[J].IEEE Transactions on Cybernetics,2022,52(2):1086-1097.
[36] JI Z,WANG H,HAN J,et al.Saliency-guided attention network for image-sentence matching[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision,2019.
[37] ZHANG Q,LEI Z,ZHANG Z,et al.Context-aware attention network for image-text retrieval[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020.
[38] JI Z,LIN Z,WANG H,et al.Multi-modal memory enhancement attention network for image-text matching[J].IEEE Access,2020,8:38438-38447.
[39] WEI K,ZHOU Z.Adversarial attentive multi-modal embedding learning for image-text matching[J].IEEE Access,2020,8:96237-96248.
[40] WANG S,CHEN Y,ZHUO J,et al.Joint global and co-attentive representation learning for image-sentence retrieval[C]//Proceedings of the 26th ACM International Conference on Multimedia,2018.