Review of Research on Multimodal Retrieval

doi:10.3778/j.issn.1002-8331.2305-0294

Abstract

Abstract: With the increasing of multimodal data, multimodal retrieval technology has received a lot of attention. With the introduction of computer and big data technology in automobile, medical and other industries, a large amount of industry data itself are presented in a multi-modal form. With the rapid development of the industry, people’s demand for information is constantly increasing, and single modal data retrieval can no longer meet people’s demand for information. In order to solve these problems and meet the needs of data retrieval from one mode and other modes, this paper studies multi-modal retrieval methods through literature review, analyzes different research methods such as common subspace, deep learning and multi-modal Hash algorithm, and sorts out the multi-modal retrieval techniques proposed by researchers in recent years to solve these problems. Finally, the multimodal retrieval methods proposed in recent years are evaluated and compared according to the accuracy, efficiency and characteristics of the retrieval. This paper analyzes the challenges encountered in multimodal retrieval and looks forward to the future application prospects of multimodal retrieval.

Key words: multimodal retrieval, public subspace, deep learning, Hash algorithm

摘要： 多模态数据的日益增长使得多模态检索技术也相继受到了不少关注。随着汽车、医学等行业引入计算机与大数据技术，大量的行业数据其本身都是以多模态形式呈现出来的，行业的快速发展使人们对信息的需求不断增加，单一模态数据检索已经无法满足人们对信息的需求。为了解决这些问题，满足一种模态的数据检索其他模态数据的需求，通过文献的查阅对多模态检索的方法进行研究，分析了公共子空间、深度学习、多模态哈希算法等不同的研究方法，梳理了近年来提出的解决这些问题的多模态检索技术。最后，对近几年来提出的多模态检索方法根据检索的准确性、检索的效率以及特点等多方面进行评价对比；对多模态检索所遇到的挑战进行分析，并展望多模态检索未来的应用前景。

关键词: 多模态检索, 公共子空间, 深度学习, 哈希算法

JIN Tao, JIN Ran, HOU Tengda, YUAN Jie, GU Xiaozhe. Review of Research on Multimodal Retrieval[J]. Computer Engineering and Applications, 2024, 60(5): 62-75.

金涛, 金冉, 侯腾达, 袁杰, 顾骁哲. 多模态检索研究综述[J]. 计算机工程与应用, 2024, 60(5): 62-75.

References

[1] SHI L, LUO J, ZHU C, et al. A survey on cross-media search based on user intention understanding in social networks[J]. Information Fusion, 2023, 91: 566-581.
[2] 侯腾达, 金冉, 王晏祎, 等. 跨模态检索研究综述[J]. 计算机工程与应用. 2022, 58(24): 61-72.
HOU T D, JIN R, WANG Y W, et al. Review of cross-modal retrieval[J]. Computer Engineering and Applications, 2022, 58(24): 61-72.
[3] HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical correlation analysis: an overview with application to learning methods[J]. Neural Computation, 2004, 16(12): 2639-2664.
[4] ZHANG H, LIU Y, MA Z. Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval[J]. Neurocomputing, 2013, 119: 10-16.
[5] HWANG S J, GRAUMAN K. Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J]. International Journal of Computer Vision, 2012, 100(2): 134-153.
[6] RASIWASIA N, MAHAJAN D, MAHADEVAN V, et al. Cluster canonical correlation analysis[C]//Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Apr 22-25, 2014: 823-831.
[7] GONG Y, KE Q, ISARD M, et al. A multi-view embedding space for modeling internet images, tags, and their semantics[J]. International Journal of Computer Vision, 2014, 106: 210-233.
[8] RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label cross-modal retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision, Santiago , Nov 7-13, 2015. New York: IEEE, 2015: 4094-4102.
[9] SHU X, ZHAO G. Scalable multi-label canonical correlation analysis for cross-modal retrieval[J]. Pattern Recognition, 2021, 115(8): 107905.
[10] FENG F, WANG X, LI R. Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the ACM Multimedia 2014, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 7-16.
[11] ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning, Atlanta, Jun 16-21. Cambridge: MIT Press, 2013: 1247-1255.
[12] WEI Y, ZHAO Y, LU C, et al. Cross-modal retrieval with CNN visual features: a new baseline[J]. IEEE Transactions on Cybernetics, 2016, 47(2): 449-460.
[13] HUANG X, PENG Y, YUAN M. MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval[J]. IEEE Transactions on Cybernetics, 2018, 50(3): 1047-1059.
[14] WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C]//Proceedings of the ACM International Conference on Multimedia, Silicon Valley, Nov 23-27, 2017. New York: ACM, 2017: 154-162.
[15] PENG Y, QI J. CM-GANs: cross-modal generative adversarial networks for common representation learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2019, 15(1): 1-24.
[16] PENG Y, QI J, HUANG X, et al. CCL: cross-modal correlation learning with multi-grained fusion by hierarchical network[J]. IEEE Transactions on Multimedia, 2017, 20(2): 405-420.
[17] CHENG Q, GU X. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval[J]. Multimedia Tools and Applications, 2020, 79(41): 31401-31428.
[18] XU P, YIN Q, HUANG Y, et al. Cross-modal subspace learning for fine-grained sketch-based image retrieval[J]. Neurocomputing, 2018, 278: 75-86.
[19] MESSINA N, AMATO G, ESULI A, et al. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2021, 17(4): 1-23.
[20] YUAN Z, ZHANG W, FU K, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-19.
[21] WANG H, LIN G, HOI S, et al. Paired cross-modal data augmentation for fine-grained image-to-text retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, Oct 10-14, 2022. New York: Association for Computing Machinery, 2022: 5517-5526.
[22] ZENG P, GAO L, LYU X, et al. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching[C]//Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, Oct 10-14, 2021. New York: Association for Computing Machinery, 2021: 2205-2213.
[23] SHENG S, LAENEN K, VAN GOOL L, et al. Fine-grained cross-modal retrieval for cultural items with focal attention and hierarchical encodings[J]. Computers (Basel), 2021, 10(9): 105.
[24] LI W, WANG Y, SU Y, et al. Multi-scale fine-grained alignments for image and sentence matching[J]. IEEE Transactions on Multimedia, 2021, 25: 543-556.
[25] SHEN Y, SUN X, WEI X, et al. A channel mix method for fine-grained cross-modal retrieval[C]//Proceedings of the 2022 IEEE International Conference on Multimedia and Expo, 2022: 1-6.
[26] PENG S, HE Y, LIU X, et al. Relation-aggregated cross-graph correlation learning for fine-grained image-text retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022(1): 1-14.
[27] WANG S, CHANG J, WANG Z, et al. Fine-grained re trieval prompt tuning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, Feb 7-14, 2023. Palo Alto: AAAI Press, 2023: 2644-2652.
[28] XU X, WANG T, YANG Y, et al. Cross-modal attention with semantic consistence for image-text matching[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5412-5425.
[29] LI Z, LING F, ZHANG C. Cross-media image-text retrieval combined with global similarity and local similarity[C]//Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington, Oct 5-8, 2019. New York: IEEE, 2019: 145-153.
[30] LI Z, LING F, ZHANG C, et al. Combining global and local similarity for cross-media retrieval[J]. IEEE Access, 2020, 8: 21847-21856.
[31] ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, June 14-19, 2020. New York: IEEE, 2020: 3536-3545.
[32] ZHU J, LI Z, WEI J, et al. Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval[J]. Image and Vision Computing, 2022, 124: 104507.
[33] ZENG S, LIU C, ZHOU J, et al. Learning hierarchical semantic correspondences for cross-modal image-text retrieval[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval, New York, 2022. New York: Association for Computing Machinery, 2022: 239-248.
[34] ZHANG B, SUN X, LI X, et al. Similarity contrastive capsule transformation for image-text matching[C]//Proceedings of the 9th International Conference on Mechatronics and Robotics Engineering (ICMRE), Shenzhen, China, Feb 10-12, 2023. New York: IEEE, 2023: 84-85.
[35] XIE X, HOU C, LI Z. Fine-grained matching with multi-perspective similarity modeling for cross-modal retrieval[C]//Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Netherlands, Eindhoven, Aug. 1-5, 2022. Cambridge: MIT Press, 2022: 2148-2158.
[36] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Held Virtually, Feb 2-9, 2021. Palo Alto: AAAI Press, 2021: 1218-1226.
[37] KOU F, DU J, CUI W, et al. Common semantic represen tation method based on object attention and adversarial learning for cross-modal data in IoV[J]. IEEE Transactions on Vehicular Technology, 2019, 68(12): 11588-11598.
[38] SHI L, DU J, CHENG G, et al. Cross-media search method based on complementary attention and generative adversarial network for social networks[J]. International Journal of Intelligent Systems, 2022, 37(8): 4393-4416.
[39] PRASOMPHAN S. Toward fine-grained image retrieval with adaptive deep learning for cultural heritage image[J]. Computer Systems Science & Engineering, 2023, 44(2): 1295-1307.
[40] CAO W, FENG W, LIN Q, et al. A review of Hashing methods for multimodal retrieval[J]. IEEE Access, 2020, 8: 15377-15391.
[41] KUMAR S, UDUPA R. Learning Hash functions for cross-view similarity search[C]//Proceedings of the International Joint Conference on Artificial Intelligence, Barcelona, Jul 19-22, 2011. Palo Alto: AAAI Press, 2011: 1360-1365.
[42] SONG J, YANG Y, YANG Y, et al. Inter-media Hashing for large-scale retrieval from heterogeneous data sources[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, Jun 22-27, 2013. New York: ACM, 2013: 785-796.
[43] ZHU X, HUANG Z, SHEN H T, et al. Linear cross-modal hashing for efficient multimedia search[C]//Proceedings of the 21st ACM interna-tional conference on Multimedia, Barcelona, Oct 21-25, 2013. New York: ACM, 2013: 143-152.
[44] DING G, GUO Y, ZHOU J. Collective matrix factoriza tion hashing for multimodal data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, Jun 23-28, 2014. New York: IEEE, 2014: 2075-2082.
[45] CHENG J, LENG C, LI P, et al. Semi-supervised multi-graph hashing for scalable similarity search[J]. Computer Vision and Image Understanding, 2014, 124: 12-21.
[46] XIA H, JING T, CHEN C, et al. Semi-supervised do-main adaptive retrieval via discriminative hashing learning[C]//Proceedings of the 29th ACM Interna-tional Conference on Multimedia, Chengdu, Oct 20-24, 2021. New York: ACM, 2021: 3853-3861.
[47] BRONSTEIN M M, BRONSTEIN A M, MICHEL F, et al. Data fusion through cross-modality metric learning us-ing similarity-sensitive hashing[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. New York: IEEE, 2010: 3594-3601.
[48] ZHANG D, LI W. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec City, Jul 27-31, 2014. Menlo Park: AAAI, 2014: 2177-2183.
[49] WANG D, GAO X, WANG X, et al. Semantic topic multimodal hashing for cross-media retrieval[C]//Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Jul 25-31, 2015. Palo Alto: AAAI Press, 2015: 3890-3896.
[50] LIN Z, DING G, HU M, et al. Semantics-preserving hashing for cross-view retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 3864-3872.
[51] LIU Y, WU J, QU L, et al. Self-supervised correlation learning for cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2022, 25: 2851-2863.
[52] LUO K, ZHANG C, LI H, et al. Adaptive marginalized semantic hashing for unpaired cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2023(1): 1-14.
[53] LI C, DENG C, WANG L, et al. Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii, Jan 27-Feb 1, 2019. Palo Alto: AAAI Press, 2019: 176-183.
[54] DUAN Y, CHEN N, BASHIR A K, et al. A Web Knowledge-driven multimodal retrieval method in computational social systems: unsupervised and robust graph convolutional Hashing[J]. IEEE Transactions on Computational Social Systems, 2022(1): 1-11.
[55] YANG D, WU D, ZHANG W, et al. Deep semantic-alignment hashing for unsupervised cross-modal retrieval[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval, Ireland, June 8-11, 2020. New York: Association for Computing Machinery, 2020: 44-52.
[56] YU J, ZHOU H, ZHAN Y, et al. Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Held Virtually, Feb 2-9, 2021. Palo Alto: AAAI Press, 2021: 4626-4634.
[57] JIANG Q Y, LI W J. Deep cross-modal Hashing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3270-3278.
[58] LI C, DENG C, LI N, et al. Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 4242-4251.
[59] ZOU X, WU S, ZHANG N, et al. Multi-label modality enhanced attention based self-supervised deep cross-modal hashing[J]. Knowledge-Based Systems, 2022, 239: 107927.
[60] DENG C, CHEN Z, LIU X, et al. Triplet-based deep hashing network for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2018, 27(8): 3893-3903.
[61] BAI C, ZENG C, MA Q, et al. Graph convolutional network discrete hashing for cross-modal retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, DOI:10.1109/TNNLS.2022.3174970.
[62] CHEN Z, LUO X, WANG Y, et al. Fine-grained hashing with double filtering[J]. IEEE Transactions on Image Processing, 2022, 31: 1671-1683.
[63] QIN J, FEI L, ZHANG Z, et al. Joint specifics and con sistency hash learning for large-scale cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2022, 31: 5343-5358.
[64] JIN W, ZHAO Z, ZHANG P, et al. Hierarchical Cross-modal graph consistency learning for video-text retrieval[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, July 11-15, 2021. New York: Association for Computing Machinery, 2021: 1114-1124.
[65] CHEN S, ZHAO Y, JIN Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, 14-19 June, 2020. New York: IEEE, 2020: 10638-10647.
[66] FENG Z, ZENG Z, GUO C, et al. Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(3): 1438-1453.
[67] MA Y, XU G, SUN X, et al. X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, Oct 10-14, 2022. New York: Association for Computing Machinery, 2022: 638-647.
[68] JIN M, ZHANG H, ZHU L, et al. Coarse-to-fine dual-level attention for video-text cross modal retrieval[J]. Knowledge-Based Systems, 2022, 242: 108354.
[69] MEI X, LIU X, SUN J, et al. On metric learning for audio-text cross-modal retrieval[J]. arXiv:2203.15537.2022.
[70] ELIZALDE B, ZARAR S, RAJ B. Cross modal audio search and retrieval with joint embeddings based on text and audio[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019. New York: IEEE, 2019: 4095-4099.
[71] BAI Y, YI J, TAO J, et al. Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 1897-1911.
[72] CHAO Y, YANG D, GU R, et al. 3CMLF: three-stage curriculum-based mutual learning framework for audio-text retrieval[C]//Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Nov 7-10, 2022. New York: IEEE, 2022: 1602-1607.
[73] SONG F, HU J, WANG C, et al. Cross-modal audio-text retrieval via sequential feature augmentation[C]//Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning, Shanghai, March 17-19, 2023. New York: Association for Computing Machinery, 2023: 298-304.
[74] PADDEU G, DEVOLA A, FERRERO A, et al. Interactive audio-text guide for museum accessibility[C]//Proceedings of the 18th International Conference on WWW/Internet 2019, Nov 7, 2019. 186-188.
[75] RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Oct 25-29, 2010. New York: ACM, 2010: 251-260.
[76] RASHTCHIAN C, YOUNG P, HODOSH M, et al. Collecting image annotations using amazon’s mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, Jun 6-10, 2010. Stroudsburg: ACL, 2010: 139-147.
[77] YOUNG P, LAI A, HODOSH M, et al. From image de scriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[78] CHUA T, TANG J, HONG R, et al. Nuswide: a real-world web image database from national university of singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini Island, Jul 8-10, 2009. New York: ACM, 2009: 1-9.
[79] LIN T, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, September 6-12, 2014. Berlin: Springer, 2014: 740-755.