Review of Cross-Modal Retrieval

doi:10.3778/j.issn.1002-8331.2205-0064

Abstract

Abstract: In the past decades, various types of media data, such as audio, text, image and video, have shown explosive growth on the Internet. Different types of data are usually used to describe the same event or theme. Cross modal retrieval（CMR） provides some effective methods, which can search the semantic related results of different modes for a given query of any mode, so that users can obtain more information about events / topics, so as to achieve the effect of retrieving data of one mode from data of another mode. With the development of the first mock exam and the demand of data retrieval and the development of new technologies, researchers have proposed many cross-modal retrieval techniques to solve this problem. This paper reviews the recent research results of researchers in the field of cross modal retrieval, briefly analyzes the traditional cross modal retrieval methods, focuses on the cross-modal retrieval methods proposed by researchers in recent five years, and compares their performance. This paper summarizes the problems faced in the research process of cross modal retrieval at this stage, and looks forward to the future development.

Key words: cross-modal retrieval, subspace learning, deep learning, cross-modal hashing

摘要： 近年来，各种类型的媒体数据，如音频、文本、图像和视频，在互联网上呈现爆发式增长，不同类型的数据通常用于描述同一事件或主题。跨模态检索提供了一些有效的方法，可以为任何模态的给定查询搜索不同模态的语义相关结果，使用户能够获得有关事件/主题的更多信息，从而达到以一种模态数据检索另外一种模态数据的效果。随着数据检索需求以及各种新技术的发展，单一模态检索难以满足用户需求，研究者提出许多跨模态检索的技术来解决这个问题。梳理近期跨模态检索领域研究者的研究成果，简要分析传统的跨模态检索方法，着重介绍近五年研究者提出跨模态检索方法，并对其性能表现进行对比；总结现阶段跨模态检索研究过程中面临的问题，并对后续发展做出展望。

关键词: 跨模态检索, 子空间学习, 深度学习, 跨模态哈希

HOU Tengda, JIN Ran, WANG Yanyi, JIANG Yikai. Review of Cross-Modal Retrieval[J]. Computer Engineering and Applications, 2022, 58(24): 61-72.

侯腾达, 金冉, 王晏祎, 蒋义凯. 跨模态检索研究综述[J]. 计算机工程与应用, 2022, 58(24): 61-72.

References

[1] PENG Y，HUANG X，ZHAO Y.An overview of cross-media retrieval：concepts，methodologies，benchmarks，and challenges[J].IEEE Transactions on Circuits and Systems for Video Technology，2018，28（9）：2372-2385.
[2] HARDOON D R，SZEDMAK S，SHAWE-TAYLOR J.Canonical correlation analysis：an overview with application to learning methods[J].Neural Computation，2004，16（12）：2639-2664.
[3] FENG F，WANG X，LI R.Cross-modal retrieval with correspondence autoencoders[C]//ACM Multimedia 2014，Orlando，Nov 3-7，2014.New York：ACM，2014.
[4] RANJAN V，RASIWASIA N，JAWAHAR C V.Multi-label cross-modal retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision，Santiago，Nov 7-13，2015.New York：IEEE，2015：4094-4102.
[5] WANG B，YANG Y，XU X，et al.Adversarial cross-modal retrieval[C]//Proceedings of the ACM International Conference on Multimedia.New York：ACM，2017.
[6] WANG G，JI H，KONG D，et al.Modality-dependent cross-modal retrieval based on graph regularization[J].Mobile Information Systems，2020（4）：1-17.
[7] HOTELLING H.Relations between two sets of variates.[J].Biometrika，1936，28：321-377.
[8] HWANG S J，GRAUMAN K.Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J].International Journal of Computer Vision，2012，100（2）：134-153.
[9] CAI J，HUANG X.Robust kernel canonical correlation analysis with applications to information retrieval[J].Engineering Applications of Artificial Intelligence，2017，64：33-42.
[10] GONG Y，KE Q，ISARD M，et al.A multi-view embedding space for modeling internet images，tags，and their semantics[J].International Journal of Computer Vision，2014，106（2）：210-233.
[11] SHAO J，ZHAO Z，SU F，et al.Towards improving canonical correlation analysis for cross-modal retrieval[C]//Proceedings of the on Thematic Workshops of ACM Multimedia 2017，California，Dec 23-27，2017.New York：ACM，2017：332-339.
[12] PEREIRA J C，COVIELLO E，DOYLE G，et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，36（3）：521-535.
[13] ANDREW G，ARORA R，BILMES J，et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning，Atlanta，June 16-21，2013.New York：PMLR，2013：1247-1255.
[14] ZENG D，OYAMA K.Learning joint embedding for cross-modal retrieval[C]//2019 International Conference on Data Mining Workshops（ICDMW），Beijing，Nov 8-11，2019.Piscataway：IEEE，2019：1070-1071.
[15] WEI Y，ZHAO Y，LU C，et al.Cross-modal retrieval with CNN visual features：a new baseline[J].IEEE Transactions on Cybernetics，2016，47（2）：449-460.
[16] DUMPALA S H，SHEIKH I，CHAKRABORTY R，et al.Audio-visual fusion for sentiment classification using cross-modal autoencoder[C]//32nd Conference on Neural Information Processing Systems（NIPS 2018），Vancouver，Dec 8-14，2019：1-4.
[17] ZENG D，YU Y，OYAMA K.Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval[J].ACM Transactions on Multimedia Computing，Communications，and Applications（TOMM），2020，16（3）：1-23.
[18] SHU X，ZHAO G.Scalable multi-label canonical correlation analysis for cross-modal retrieval[J].Pattern Recognition，2021，115：107905.
[19] WANG K，HE R，WANG L，et al.Joint feature selection and subspace learning for cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2015，38（10）：2010-2023.
[20] BELKIN M，MATVEEVA I，NIYOGI P.Regularization and semi-supervised learning on large graphs[C]//International Conference on Computational Learning Theory，Banff，Jul 1-4，2004.Berlin：Springer，2004：624-638.
[21] ZHAI X，PENG Y，XIAO J.Learning cross-media joint representation with sparse and semisupervised regularization[J].IEEE Transactions on Circuits and Systems for Video Technology，2013，24（6）：965-978.
[22] ZHAI X，PENG Y，XIAO J.Heterogeneous metric learning with joint graph regularization for cross-media retrieval[C]//Twenty-Seventh AAAI Conference on Artificial Intelligence，Bellevue，Jul 14-18，2013.Menlo Park：AAAI，2013.
[23] LI W，ZHENG Y，ZHANG Y，et al.Cross-modal retrieval with dual multi-angle self-attention[J].Journal of the Association for Information Science and Technology，2021，72（1）：46-65.
[24] JIN M，ZHANG H，ZHU L，et al.Coarse-to-fine dual-level attention for video-text cross modal retrieval[J].Knowledge-Based Systems，2022，242：108354.
[25] ZHONG J，CHEN K，HE Y，et al.Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval[J].Information Sciences，2022，65（7）：153-165.
[26] XIE Z，LIU L，WU Y，et al.Learning text-image joint embedding for efficient cross-modal retrieval with deep feature engineering[J].ACM Transactions on Information Systems（TOIS），2021，40（4）：1-27.
[27] ZHAO K Q，WANG H F，ZHAO D X.Double-scale similarity with rich features for cross-modal retrieval[J].Multimedia Systems，2022，28：1767-1777.
[28] GAO Y，ZHOU H，CHEN L，et al.Cross-modal object detection based on a knowledge update[J].Sensors，2022，22（4）：1338.
[29] XU X，TIAN J，LIN K，et al.Zero-shot cross-modal retrieval by assembling autoencoder and generative adversarial network[J].ACM Transactions on Multimedia Computing，Communications，and Applications，2021，17（1）：1-17.
[30] WU Y，WANG S，SONG G，et al.Augmented adversarial training for cross-modal retrieval[J].IEEE Transactions on Multimedia，2020，23：559-571.
[31] GUO Y，CHEN J，ZHANG H，et al.Visual relations augmented cross-modal retrieval[C]//ICMR’20：Proceedings of the 2020 International Conference on Multimedia Retrieval，Dublin，Oct 26-29，2020.NewYork：ACM，2020：9-15.
[32] CHEN H，DING G，LIU X，et al.Imram：iterative matching with recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：12655-12663.
[33] CHENG S，WANG L，DU A，et al.Bidirectional focused semantic alignment attention network for cross-modal retrieval[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2021：4340-4344.
[34] TIAN Y，YANG W，LIU Q，et al.Deep supervised multimodal semantic autoencoder for cross-modal retrieval[J].Computer Animation and Virtual Worlds，2020，31（4/5）：e1962.
[35] HE S，WANG W，WANG Z，et al.Category alignment adversarial learning for cross-modal retrieval[J].IEEE Transactions on Knowledge and Data Engineering，2022：1.
[36] BRONSTEIN M M，BRONSTEIN A M，MICHEL F，et al.Data fusion through cross-modality metric learning using similarity-sensitive hashing[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2010：3594-3601.
[37] KUMAR S，UDUPA R.Learning hash functions for cross-view similarity search[C]//Twenty-Second International Joint Conference on Artificial Intelligence，2011.
[38] WANG J，KUMAR S，CHANG S F.Semi-supervised hashing for large-scale search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2012，34（12）：2393-2406.
[39] JIANG Q，LI W.Deep cross-modal hashing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，Jul 21-26，2017.Piscataway：IEEE，2017：3232-3240.
[40] LI C，DENG C，LI N，et al.Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Salt Lake City，Jun 18-21，2018.Piscataway：IEEE，2018：4242-4251.
[41] LIONG V E，LU J，TAN Y P.Cross-modal discrete hashing[J].Pattern Recognition，2018，79：114-129.
[42] ZHAN Y W，WANG Y，SUN Y，et al.Discrete online cross-modal hashing[J].Pattern Recognition，2022，122：108262.
[43] ZHANG D，LI W.Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the AAAI Conference on Artificial Intelligence，Québec City，Jul 27-31，2014.Palo Alto：AAAI，2014.
[44] YANG Z，YANG L，RAYMOND O I，et al.NSDH：a nonlinear supervised discrete hashing framework for large-scale cross-modal retrieval[J].Knowledge-Based Systems，2021，217（3）：106818.
[45] JIANG Q Y，LI W J.Discrete latent factor model for cross-modal hashing[J].arXiv：1707.08322，2017.
[46] QIANG H，WAN Y，XIANG L，et al.Deep semantic similarity adversarial hashing for cross-modal retrieval[J].Neurocomputing，2020，400：24-33.
[47] ZOU X，WANG X，BAKKER E M，et al.Multi-label semantics preserving based deep cross-modal hashing[J].Signal Processing：Image Communication，2021，93：116131.
[48] PENG H，HE J，CHEN S，et al.Dual-supervised attention network for deep cross-modal hashing[J].Pattern Recognition Letters，2019，128：333-339.
[49] WANG X，ZOU X，BAKKER E M，et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing，2020，400：255-271.
[50] LIN Q，CAO W，HE Z，et al.Semantic deep cross-modal hashing[J].Neurocomputing，2020，396：113-122.
[51] LI F，WANG T，ZHU L，et al.Task-adaptive asymmetric deep cross-modal hashing[J].Knowledge-Based Systems，2021，219：106851.
[52] SHEN X，ZHANG H，LI L，et al.Semi-supervised cross-modal hashing with multi-view graph representation[J].Information Sciences，2022，604：45-60.
[53] ZHANG J，PENG Y，YUAN M.SCH-GAN：semi-supervised cross-modal hashing by generative adversarial network[J].arXiv：1802.02488，2018.
[54] WANG X，LIU X，PENG S，et al.Semi-supervised discrete hashing for efficient cross-modal retrieval[J].Multimedia Tools and Applications，2020，79（35/36）：25335-25336.
[55] LI D，DU C，WANG H，et al.Deep modality assistance co-training network for semi-supervised multi-label semantic decoding[J].IEEE Transactions on Multimedia，2021（24）：3287-3299.
[56] ?ANCULEF R，MENA F，MACALUSO A，et al.Self-supervised Bernoulli autoencoders for semi-supervised hashing[C]//Iberoamerican Congress on Pattern Recognition，Porto，May 10-13，2021.Berlin：Springer，2021：258-268.
[57] DING G，GUO Y，ZHOU J.Collective matrix factorization hashing for multimodal data[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Columbus，Jun 23-28，2014.Piscataway：IEEE，2014：2075-2082.
[58] CHENG M，JING L，NG M K.Robust unsupervised cross-modal hashing for multimedia retrieval[J].ACM Transactions on Information Systems，2020，38（3）：1-25.
[59] LI M，LI Q，TANG L，et al.Deep unsupervised hashing for large-scale cross-modal retrieval using knowledge distillation model[J].Computational Intelligence and Neuroscience，2021：1-11.
[60] LIU Y，WU J，QU L，et al.Self-supervised correlation learning for cross-modal retrieval[J].IEEE Transactions on Multimedia，2022：5107034.
[61] YU J，ZHOU H，ZHAN Y，et al.Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing[C]//Proceedings of the AAAI Conference on Artificial Intelligence，Held Virtually，Feb 2-9，2021.Palo Alto：AAAI，2021：4626-4634.
[62] LIN Q，CAO W，HE Z，et al.Mask cross-modal hashing networks[J].IEEE Transactions on Multimedia，2020，23：550-558.
[63] SHI G，LI F，WU L，et al.Object-level visual-text correlation graph hashing for unsupervised cross-modal retrieval[J].Sensors，2022，22（8）：2921.
[64] LUO J，WO Y，WU B，et al.Learning sufficient scene representation for unsupervised cross-modal retrieval[J].Neurocomputing，2021，461：404-418.
[65] RASIWASIA N，COSTA PEREIRA J，COVIELLO E，et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia，Firenze，Italy，Oct 25-29，2010.New York：ACM，2010：251-260.
[66] RASHTCHIAN C，YOUNG P，HODOSH M，et al.Collecting image annotations using Amazon’s mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk，Los Angeles，Jun 6-10，2010.Stroudsburg：ACL，2010：139-147.
[67] CHUA T，TANG J，HONG R，et al.Nus-wide：a real-world web image database from national university of singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval，2009：1-9.
[68] DENG J，DONG W，SOCHER R，et al.Imagenet：a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition，Miami，Jun 20-25，2009.Piscataway：IEEE，2009：248-255.
[69] CHEN X，FANG H，LIN T，et al.Microsoft coco captions：data collection and evaluation server[J].arXiv：1504.00325，2015.
[70] HUISKES M J，LEW M S.The mir flickr retrieval evaluation[C]//Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval，Vancouver，Oct 30-31，2008：39-43.
[71] KOU F，DU J，CUI W，et al.Common semantic representation method based on object attention and adversarial learning for cross-modal data in IoV[J].IEEE Transactions on Vehicular Technology，2019，68（12）：11588-11598.
[72] SHI L，DU J，CHENG G，et al.Cross-media search method based on complementary attention and generative adversarial network for social networks[J].International Journal of Intelligent Systems，2021，37（8）：4393-4416.
[73] MISRAA A K，KALE A，AGGARWAL P，et al.Multi-modal retrieval using graph neural networks[J].arXiv：2010.01666，2020.