计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (5): 62-75.DOI: 10.3778/j.issn.1002-8331.2305-0294
金涛,金冉,侯腾达,袁杰,顾骁哲
出版日期:
2024-03-01
发布日期:
2024-03-01
JIN Tao, JIN Ran, HOU Tengda, YUAN Jie, GU Xiaozhe
Online:
2024-03-01
Published:
2024-03-01
摘要: 多模态数据的日益增长使得多模态检索技术也相继受到了不少关注。随着汽车、医学等行业引入计算机与大数据技术,大量的行业数据其本身都是以多模态形式呈现出来的,行业的快速发展使人们对信息的需求不断增加,单一模态数据检索已经无法满足人们对信息的需求。为了解决这些问题,满足一种模态的数据检索其他模态数据的需求,通过文献的查阅对多模态检索的方法进行研究,分析了公共子空间、深度学习、多模态哈希算法等不同的研究方法,梳理了近年来提出的解决这些问题的多模态检索技术。最后,对近几年来提出的多模态检索方法根据检索的准确性、检索的效率以及特点等多方面进行评价对比;对多模态检索所遇到的挑战进行分析,并展望多模态检索未来的应用前景。
金涛, 金冉, 侯腾达, 袁杰, 顾骁哲. 多模态检索研究综述[J]. 计算机工程与应用, 2024, 60(5): 62-75.
JIN Tao, JIN Ran, HOU Tengda, YUAN Jie, GU Xiaozhe. Review of Research on Multimodal Retrieval[J]. Computer Engineering and Applications, 2024, 60(5): 62-75.
[1] SHI L, LUO J, ZHU C, et al. A survey on cross-media search based on user intention understanding in social networks[J]. Information Fusion, 2023, 91: 566-581. [2] 侯腾达, 金冉, 王晏祎, 等. 跨模态检索研究综述[J]. 计算机工程与应用. 2022, 58(24): 61-72. HOU T D, JIN R, WANG Y W, et al. Review of cross-modal retrieval[J]. Computer Engineering and Applications, 2022, 58(24): 61-72. [3] HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical correlation analysis: an overview with application to learning methods[J]. Neural Computation, 2004, 16(12): 2639-2664. [4] ZHANG H, LIU Y, MA Z. Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval[J]. Neurocomputing, 2013, 119: 10-16. [5] HWANG S J, GRAUMAN K. Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J]. International Journal of Computer Vision, 2012, 100(2): 134-153. [6] RASIWASIA N, MAHAJAN D, MAHADEVAN V, et al. Cluster canonical correlation analysis[C]//Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Apr 22-25, 2014: 823-831. [7] GONG Y, KE Q, ISARD M, et al. A multi-view embedding space for modeling internet images, tags, and their semantics[J]. International Journal of Computer Vision, 2014, 106: 210-233. [8] RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label cross-modal retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision, Santiago , Nov 7-13, 2015. New York: IEEE, 2015: 4094-4102. [9] SHU X, ZHAO G. Scalable multi-label canonical correlation analysis for cross-modal retrieval[J]. Pattern Recognition, 2021, 115(8): 107905. [10] FENG F, WANG X, LI R. Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the ACM Multimedia 2014, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 7-16. [11] ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning, Atlanta, Jun 16-21. Cambridge: MIT Press, 2013: 1247-1255. [12] WEI Y, ZHAO Y, LU C, et al. Cross-modal retrieval with CNN visual features: a new baseline[J]. IEEE Transactions on Cybernetics, 2016, 47(2): 449-460. [13] HUANG X, PENG Y, YUAN M. MHTN: modal-adversarial hybrid transfer network for cross-modal retrieval[J]. IEEE Transactions on Cybernetics, 2018, 50(3): 1047-1059. [14] WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C]//Proceedings of the ACM International Conference on Multimedia, Silicon Valley, Nov 23-27, 2017. New York: ACM, 2017: 154-162. [15] PENG Y, QI J. CM-GANs: cross-modal generative adversarial networks for common representation learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2019, 15(1): 1-24. [16] PENG Y, QI J, HUANG X, et al. CCL: cross-modal correlation learning with multi-grained fusion by hierarchical network[J]. IEEE Transactions on Multimedia, 2017, 20(2): 405-420. [17] CHENG Q, GU X. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval[J]. Multimedia Tools and Applications, 2020, 79(41): 31401-31428. [18] XU P, YIN Q, HUANG Y, et al. Cross-modal subspace learning for fine-grained sketch-based image retrieval[J]. Neurocomputing, 2018, 278: 75-86. [19] MESSINA N, AMATO G, ESULI A, et al. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2021, 17(4): 1-23. [20] YUAN Z, ZHANG W, FU K, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-19. [21] WANG H, LIN G, HOI S, et al. Paired cross-modal data augmentation for fine-grained image-to-text retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, Oct 10-14, 2022. New York: Association for Computing Machinery, 2022: 5517-5526. [22] ZENG P, GAO L, LYU X, et al. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching[C]//Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, Oct 10-14, 2021. New York: Association for Computing Machinery, 2021: 2205-2213. [23] SHENG S, LAENEN K, VAN GOOL L, et al. Fine-grained cross-modal retrieval for cultural items with focal attention and hierarchical encodings[J]. Computers (Basel), 2021, 10(9): 105. [24] LI W, WANG Y, SU Y, et al. Multi-scale fine-grained alignments for image and sentence matching[J]. IEEE Transactions on Multimedia, 2021, 25: 543-556. [25] SHEN Y, SUN X, WEI X, et al. A channel mix method for fine-grained cross-modal retrieval[C]//Proceedings of the 2022 IEEE International Conference on Multimedia and Expo, 2022: 1-6. [26] PENG S, HE Y, LIU X, et al. Relation-aggregated cross-graph correlation learning for fine-grained image-text retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022(1): 1-14. [27] WANG S, CHANG J, WANG Z, et al. Fine-grained re trieval prompt tuning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, Feb 7-14, 2023. Palo Alto: AAAI Press, 2023: 2644-2652. [28] XU X, WANG T, YANG Y, et al. Cross-modal attention with semantic consistence for image-text matching[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5412-5425. [29] LI Z, LING F, ZHANG C. Cross-media image-text retrieval combined with global similarity and local similarity[C]//Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington, Oct 5-8, 2019. New York: IEEE, 2019: 145-153. [30] LI Z, LING F, ZHANG C, et al. Combining global and local similarity for cross-media retrieval[J]. IEEE Access, 2020, 8: 21847-21856. [31] ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, June 14-19, 2020. New York: IEEE, 2020: 3536-3545. [32] ZHU J, LI Z, WEI J, et al. Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval[J]. Image and Vision Computing, 2022, 124: 104507. [33] ZENG S, LIU C, ZHOU J, et al. Learning hierarchical semantic correspondences for cross-modal image-text retrieval[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval, New York, 2022. New York: Association for Computing Machinery, 2022: 239-248. [34] ZHANG B, SUN X, LI X, et al. Similarity contrastive capsule transformation for image-text matching[C]//Proceedings of the 9th International Conference on Mechatronics and Robotics Engineering (ICMRE), Shenzhen, China, Feb 10-12, 2023. New York: IEEE, 2023: 84-85. [35] XIE X, HOU C, LI Z. Fine-grained matching with multi-perspective similarity modeling for cross-modal retrieval[C]//Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Netherlands, Eindhoven, Aug. 1-5, 2022. Cambridge: MIT Press, 2022: 2148-2158. [36] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Held Virtually, Feb 2-9, 2021. Palo Alto: AAAI Press, 2021: 1218-1226. [37] KOU F, DU J, CUI W, et al. Common semantic represen tation method based on object attention and adversarial learning for cross-modal data in IoV[J]. IEEE Transactions on Vehicular Technology, 2019, 68(12): 11588-11598. [38] SHI L, DU J, CHENG G, et al. Cross-media search method based on complementary attention and generative adversarial network for social networks[J]. International Journal of Intelligent Systems, 2022, 37(8): 4393-4416. [39] PRASOMPHAN S. Toward fine-grained image retrieval with adaptive deep learning for cultural heritage image[J]. Computer Systems Science & Engineering, 2023, 44(2): 1295-1307. [40] CAO W, FENG W, LIN Q, et al. A review of Hashing methods for multimodal retrieval[J]. IEEE Access, 2020, 8: 15377-15391. [41] KUMAR S, UDUPA R. Learning Hash functions for cross-view similarity search[C]//Proceedings of the International Joint Conference on Artificial Intelligence, Barcelona, Jul 19-22, 2011. Palo Alto: AAAI Press, 2011: 1360-1365. [42] SONG J, YANG Y, YANG Y, et al. Inter-media Hashing for large-scale retrieval from heterogeneous data sources[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, Jun 22-27, 2013. New York: ACM, 2013: 785-796. [43] ZHU X, HUANG Z, SHEN H T, et al. Linear cross-modal hashing for efficient multimedia search[C]//Proceedings of the 21st ACM interna-tional conference on Multimedia, Barcelona, Oct 21-25, 2013. New York: ACM, 2013: 143-152. [44] DING G, GUO Y, ZHOU J. Collective matrix factoriza tion hashing for multimodal data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, Jun 23-28, 2014. New York: IEEE, 2014: 2075-2082. [45] CHENG J, LENG C, LI P, et al. Semi-supervised multi-graph hashing for scalable similarity search[J]. Computer Vision and Image Understanding, 2014, 124: 12-21. [46] XIA H, JING T, CHEN C, et al. Semi-supervised do-main adaptive retrieval via discriminative hashing learning[C]//Proceedings of the 29th ACM Interna-tional Conference on Multimedia, Chengdu, Oct 20-24, 2021. New York: ACM, 2021: 3853-3861. [47] BRONSTEIN M M, BRONSTEIN A M, MICHEL F, et al. Data fusion through cross-modality metric learning us-ing similarity-sensitive hashing[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, Jun 13-18, 2010. New York: IEEE, 2010: 3594-3601. [48] ZHANG D, LI W. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec City, Jul 27-31, 2014. Menlo Park: AAAI, 2014: 2177-2183. [49] WANG D, GAO X, WANG X, et al. Semantic topic multimodal hashing for cross-media retrieval[C]//Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Jul 25-31, 2015. Palo Alto: AAAI Press, 2015: 3890-3896. [50] LIN Z, DING G, HU M, et al. Semantics-preserving hashing for cross-view retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 3864-3872. [51] LIU Y, WU J, QU L, et al. Self-supervised correlation learning for cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2022, 25: 2851-2863. [52] LUO K, ZHANG C, LI H, et al. Adaptive marginalized semantic hashing for unpaired cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2023(1): 1-14. [53] LI C, DENG C, WANG L, et al. Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii, Jan 27-Feb 1, 2019. Palo Alto: AAAI Press, 2019: 176-183. [54] DUAN Y, CHEN N, BASHIR A K, et al. A Web Knowledge-driven multimodal retrieval method in computational social systems: unsupervised and robust graph convolutional Hashing[J]. IEEE Transactions on Computational Social Systems, 2022(1): 1-11. [55] YANG D, WU D, ZHANG W, et al. Deep semantic-alignment hashing for unsupervised cross-modal retrieval[C]//Proceedings of the 2020 International Conference on Multimedia Retrieval, Ireland, June 8-11, 2020. New York: Association for Computing Machinery, 2020: 44-52. [56] YU J, ZHOU H, ZHAN Y, et al. Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing[C]//Proceedings of the AAAI Conference on Artificial Intelligence, Held Virtually, Feb 2-9, 2021. Palo Alto: AAAI Press, 2021: 4626-4634. [57] JIANG Q Y, LI W J. Deep cross-modal Hashing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3270-3278. [58] LI C, DENG C, LI N, et al. Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 4242-4251. [59] ZOU X, WU S, ZHANG N, et al. Multi-label modality enhanced attention based self-supervised deep cross-modal hashing[J]. Knowledge-Based Systems, 2022, 239: 107927. [60] DENG C, CHEN Z, LIU X, et al. Triplet-based deep hashing network for cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2018, 27(8): 3893-3903. [61] BAI C, ZENG C, MA Q, et al. Graph convolutional network discrete hashing for cross-modal retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, DOI:10.1109/TNNLS.2022.3174970. [62] CHEN Z, LUO X, WANG Y, et al. Fine-grained hashing with double filtering[J]. IEEE Transactions on Image Processing, 2022, 31: 1671-1683. [63] QIN J, FEI L, ZHANG Z, et al. Joint specifics and con sistency hash learning for large-scale cross-modal retrieval[J]. IEEE Transactions on Image Processing, 2022, 31: 5343-5358. [64] JIN W, ZHAO Z, ZHANG P, et al. Hierarchical Cross-modal graph consistency learning for video-text retrieval[C]//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, July 11-15, 2021. New York: Association for Computing Machinery, 2021: 1114-1124. [65] CHEN S, ZHAO Y, JIN Q, et al. Fine-grained video-text retrieval with hierarchical graph reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, 14-19 June, 2020. New York: IEEE, 2020: 10638-10647. [66] FENG Z, ZENG Z, GUO C, et al. Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(3): 1438-1453. [67] MA Y, XU G, SUN X, et al. X-CLIP: end-to-end multi-grained contrastive learning for video-text retrieval[C]//Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, Oct 10-14, 2022. New York: Association for Computing Machinery, 2022: 638-647. [68] JIN M, ZHANG H, ZHU L, et al. Coarse-to-fine dual-level attention for video-text cross modal retrieval[J]. Knowledge-Based Systems, 2022, 242: 108354. [69] MEI X, LIU X, SUN J, et al. On metric learning for audio-text cross-modal retrieval[J]. arXiv:2203.15537.2022. [70] ELIZALDE B, ZARAR S, RAJ B. Cross modal audio search and retrieval with joint embeddings based on text and audio[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019. New York: IEEE, 2019: 4095-4099. [71] BAI Y, YI J, TAO J, et al. Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 1897-1911. [72] CHAO Y, YANG D, GU R, et al. 3CMLF: three-stage curriculum-based mutual learning framework for audio-text retrieval[C]//Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Nov 7-10, 2022. New York: IEEE, 2022: 1602-1607. [73] SONG F, HU J, WANG C, et al. Cross-modal audio-text retrieval via sequential feature augmentation[C]//Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning, Shanghai, March 17-19, 2023. New York: Association for Computing Machinery, 2023: 298-304. [74] PADDEU G, DEVOLA A, FERRERO A, et al. Interactive audio-text guide for museum accessibility[C]//Proceedings of the 18th International Conference on WWW/Internet 2019, Nov 7, 2019. 186-188. [75] RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Oct 25-29, 2010. New York: ACM, 2010: 251-260. [76] RASHTCHIAN C, YOUNG P, HODOSH M, et al. Collecting image annotations using amazon’s mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, Jun 6-10, 2010. Stroudsburg: ACL, 2010: 139-147. [77] YOUNG P, LAI A, HODOSH M, et al. From image de scriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. [78] CHUA T, TANG J, HONG R, et al. Nuswide: a real-world web image database from national university of singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini Island, Jul 8-10, 2009. New York: ACM, 2009: 1-9. [79] LIN T, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, September 6-12, 2014. Berlin: Springer, 2014: 740-755. |
[1] | 周伯俊, 陈峙宇. 基于深度元学习的小样本图像分类研究综述[J]. 计算机工程与应用, 2024, 60(8): 1-15. |
[2] | 孙石磊, 李明, 刘静, 马金刚, 陈天真. 深度学习在糖尿病视网膜病变分类领域的研究进展[J]. 计算机工程与应用, 2024, 60(8): 16-30. |
[3] | 汪维泰, 王晓强, 李雷孝, 陶乙豪, 林浩. 时空图神经网络在交通流预测研究中的构建与应用综述[J]. 计算机工程与应用, 2024, 60(8): 31-45. |
[4] | 谢威宇, 张强. 基于深度学习的图像中无人机与飞鸟检测研究综述[J]. 计算机工程与应用, 2024, 60(8): 46-55. |
[5] | 周定威, 扈静, 张良锐, 段飞亚. 面向目标检测的数据集标签遗漏的协同修正技术[J]. 计算机工程与应用, 2024, 60(8): 267-273. |
[6] | 常禧龙, 梁琨, 李文涛. 深度学习优化器进展综述[J]. 计算机工程与应用, 2024, 60(7): 1-12. |
[7] | 周钰童, 马志强, 许璧麒, 贾文超, 吕凯, 刘佳. 基于深度学习的对话情绪生成研究综述[J]. 计算机工程与应用, 2024, 60(7): 13-25. |
[8] | 姜良, 张程, 魏德健, 曹慧, 杜昱峥. 深度学习在骨质疏松辅助诊断中的应用[J]. 计算机工程与应用, 2024, 60(7): 26-40. |
[9] | 刘建华, 王楠, 白明辰. 手机室内场景要素实例化现实增强方法研究进展[J]. 计算机工程与应用, 2024, 60(7): 58-69. |
[10] | 郝志峰, 刘俊, 温雯, 蔡瑞初. 基于多序列隐关系的时序事件预测[J]. 计算机工程与应用, 2024, 60(7): 119-127. |
[11] | 袁婧, 潘甦, 谢浩, 徐文鹏. 融合投资者情绪的S_AM_BiLSTM股价预测模型[J]. 计算机工程与应用, 2024, 60(7): 274-281. |
[12] | 谭振林, 刘子良, 黄蔼权, 陈荟慧, 钟勇. 掌静脉识别的深度学习方法综述[J]. 计算机工程与应用, 2024, 60(6): 55-67. |
[13] | 赖镜安, 陈紫强, 孙宗威, 裴庆祺. 基于YOLOv5的轻量级雾天目标检测方法[J]. 计算机工程与应用, 2024, 60(6): 78-88. |
[14] | 阮慧, 黄细霞, 李登峰, 王乐. 滚动轴承细粒度故障诊断研究[J]. 计算机工程与应用, 2024, 60(6): 312-322. |
[15] | 王蓉, 端木春江. 多耦合反馈网络的图像融合和超分辨率方法[J]. 计算机工程与应用, 2024, 60(5): 210-220. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||