计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (24): 61-72.DOI: 10.3778/j.issn.1002-8331.2205-0064
侯腾达,金冉,王晏祎,蒋义凯
出版日期:
2022-12-15
发布日期:
2022-12-15
HOU Tengda, JIN Ran, WANG Yanyi, JIANG Yikai
Online:
2022-12-15
Published:
2022-12-15
摘要: 近年来,各种类型的媒体数据,如音频、文本、图像和视频,在互联网上呈现爆发式增长,不同类型的数据通常用于描述同一事件或主题。跨模态检索提供了一些有效的方法,可以为任何模态的给定查询搜索不同模态的语义相关结果,使用户能够获得有关事件/主题的更多信息,从而达到以一种模态数据检索另外一种模态数据的效果。随着数据检索需求以及各种新技术的发展,单一模态检索难以满足用户需求,研究者提出许多跨模态检索的技术来解决这个问题。梳理近期跨模态检索领域研究者的研究成果,简要分析传统的跨模态检索方法,着重介绍近五年研究者提出跨模态检索方法,并对其性能表现进行对比;总结现阶段跨模态检索研究过程中面临的问题,并对后续发展做出展望。
侯腾达, 金冉, 王晏祎, 蒋义凯. 跨模态检索研究综述[J]. 计算机工程与应用, 2022, 58(24): 61-72.
HOU Tengda, JIN Ran, WANG Yanyi, JIANG Yikai. Review of Cross-Modal Retrieval[J]. Computer Engineering and Applications, 2022, 58(24): 61-72.
[1] PENG Y,HUANG X,ZHAO Y.An overview of cross-media retrieval:concepts,methodologies,benchmarks,and challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2018,28(9):2372-2385. [2] HARDOON D R,SZEDMAK S,SHAWE-TAYLOR J.Canonical correlation analysis:an overview with application to learning methods[J].Neural Computation,2004,16(12):2639-2664. [3] FENG F,WANG X,LI R.Cross-modal retrieval with correspondence autoencoders[C]//ACM Multimedia 2014,Orlando,Nov 3-7,2014.New York:ACM,2014. [4] RANJAN V,RASIWASIA N,JAWAHAR C V.Multi-label cross-modal retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision,Santiago,Nov 7-13,2015.New York:IEEE,2015:4094-4102. [5] WANG B,YANG Y,XU X,et al.Adversarial cross-modal retrieval[C]//Proceedings of the ACM International Conference on Multimedia.New York:ACM,2017. [6] WANG G,JI H,KONG D,et al.Modality-dependent cross-modal retrieval based on graph regularization[J].Mobile Information Systems,2020(4):1-17. [7] HOTELLING H.Relations between two sets of variates.[J].Biometrika,1936,28:321-377. [8] HWANG S J,GRAUMAN K.Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J].International Journal of Computer Vision,2012,100(2):134-153. [9] CAI J,HUANG X.Robust kernel canonical correlation analysis with applications to information retrieval[J].Engineering Applications of Artificial Intelligence,2017,64:33-42. [10] GONG Y,KE Q,ISARD M,et al.A multi-view embedding space for modeling internet images,tags,and their semantics[J].International Journal of Computer Vision,2014,106(2):210-233. [11] SHAO J,ZHAO Z,SU F,et al.Towards improving canonical correlation analysis for cross-modal retrieval[C]//Proceedings of the on Thematic Workshops of ACM Multimedia 2017,California,Dec 23-27,2017.New York:ACM,2017:332-339. [12] PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(3):521-535. [13] ANDREW G,ARORA R,BILMES J,et al.Deep canonical correlation analysis[C]//International Conference on Machine Learning,Atlanta,June 16-21,2013.New York:PMLR,2013:1247-1255. [14] ZENG D,OYAMA K.Learning joint embedding for cross-modal retrieval[C]//2019 International Conference on Data Mining Workshops(ICDMW),Beijing,Nov 8-11,2019.Piscataway:IEEE,2019:1070-1071. [15] WEI Y,ZHAO Y,LU C,et al.Cross-modal retrieval with CNN visual features:a new baseline[J].IEEE Transactions on Cybernetics,2016,47(2):449-460. [16] DUMPALA S H,SHEIKH I,CHAKRABORTY R,et al.Audio-visual fusion for sentiment classification using cross-modal autoencoder[C]//32nd Conference on Neural Information Processing Systems(NIPS 2018),Vancouver,Dec 8-14,2019:1-4. [17] ZENG D,YU Y,OYAMA K.Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval[J].ACM Transactions on Multimedia Computing,Communications,and Applications(TOMM),2020,16(3):1-23. [18] SHU X,ZHAO G.Scalable multi-label canonical correlation analysis for cross-modal retrieval[J].Pattern Recognition,2021,115:107905. [19] WANG K,HE R,WANG L,et al.Joint feature selection and subspace learning for cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,38(10):2010-2023. [20] BELKIN M,MATVEEVA I,NIYOGI P.Regularization and semi-supervised learning on large graphs[C]//International Conference on Computational Learning Theory,Banff,Jul 1-4,2004.Berlin:Springer,2004:624-638. [21] ZHAI X,PENG Y,XIAO J.Learning cross-media joint representation with sparse and semisupervised regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2013,24(6):965-978. [22] ZHAI X,PENG Y,XIAO J.Heterogeneous metric learning with joint graph regularization for cross-media retrieval[C]//Twenty-Seventh AAAI Conference on Artificial Intelligence,Bellevue,Jul 14-18,2013.Menlo Park:AAAI,2013. [23] LI W,ZHENG Y,ZHANG Y,et al.Cross-modal retrieval with dual multi-angle self-attention[J].Journal of the Association for Information Science and Technology,2021,72(1):46-65. [24] JIN M,ZHANG H,ZHU L,et al.Coarse-to-fine dual-level attention for video-text cross modal retrieval[J].Knowledge-Based Systems,2022,242:108354. [25] ZHONG J,CHEN K,HE Y,et al.Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval[J].Information Sciences,2022,65(7):153-165. [26] XIE Z,LIU L,WU Y,et al.Learning text-image joint embedding for efficient cross-modal retrieval with deep feature engineering[J].ACM Transactions on Information Systems(TOIS),2021,40(4):1-27. [27] ZHAO K Q,WANG H F,ZHAO D X.Double-scale similarity with rich features for cross-modal retrieval[J].Multimedia Systems,2022,28:1767-1777. [28] GAO Y,ZHOU H,CHEN L,et al.Cross-modal object detection based on a knowledge update[J].Sensors,2022,22(4):1338. [29] XU X,TIAN J,LIN K,et al.Zero-shot cross-modal retrieval by assembling autoencoder and generative adversarial network[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2021,17(1):1-17. [30] WU Y,WANG S,SONG G,et al.Augmented adversarial training for cross-modal retrieval[J].IEEE Transactions on Multimedia,2020,23:559-571. [31] GUO Y,CHEN J,ZHANG H,et al.Visual relations augmented cross-modal retrieval[C]//ICMR’20:Proceedings of the 2020 International Conference on Multimedia Retrieval,Dublin,Oct 26-29,2020.NewYork:ACM,2020:9-15. [32] CHEN H,DING G,LIU X,et al.Imram:iterative matching with recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:12655-12663. [33] CHENG S,WANG L,DU A,et al.Bidirectional focused semantic alignment attention network for cross-modal retrieval[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2021:4340-4344. [34] TIAN Y,YANG W,LIU Q,et al.Deep supervised multimodal semantic autoencoder for cross-modal retrieval[J].Computer Animation and Virtual Worlds,2020,31(4/5):e1962. [35] HE S,WANG W,WANG Z,et al.Category alignment adversarial learning for cross-modal retrieval[J].IEEE Transactions on Knowledge and Data Engineering,2022:1. [36] BRONSTEIN M M,BRONSTEIN A M,MICHEL F,et al.Data fusion through cross-modality metric learning using similarity-sensitive hashing[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,2010:3594-3601. [37] KUMAR S,UDUPA R.Learning hash functions for cross-view similarity search[C]//Twenty-Second International Joint Conference on Artificial Intelligence,2011. [38] WANG J,KUMAR S,CHANG S F.Semi-supervised hashing for large-scale search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(12):2393-2406. [39] JIANG Q,LI W.Deep cross-modal hashing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Honolulu,Jul 21-26,2017.Piscataway:IEEE,2017:3232-3240. [40] LI C,DENG C,LI N,et al.Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,Jun 18-21,2018.Piscataway:IEEE,2018:4242-4251. [41] LIONG V E,LU J,TAN Y P.Cross-modal discrete hashing[J].Pattern Recognition,2018,79:114-129. [42] ZHAN Y W,WANG Y,SUN Y,et al.Discrete online cross-modal hashing[J].Pattern Recognition,2022,122:108262. [43] ZHANG D,LI W.Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the AAAI Conference on Artificial Intelligence,Québec City,Jul 27-31,2014.Palo Alto:AAAI,2014. [44] YANG Z,YANG L,RAYMOND O I,et al.NSDH:a nonlinear supervised discrete hashing framework for large-scale cross-modal retrieval[J].Knowledge-Based Systems,2021,217(3):106818. [45] JIANG Q Y,LI W J.Discrete latent factor model for cross-modal hashing[J].arXiv:1707.08322,2017. [46] QIANG H,WAN Y,XIANG L,et al.Deep semantic similarity adversarial hashing for cross-modal retrieval[J].Neurocomputing,2020,400:24-33. [47] ZOU X,WANG X,BAKKER E M,et al.Multi-label semantics preserving based deep cross-modal hashing[J].Signal Processing:Image Communication,2021,93:116131. [48] PENG H,HE J,CHEN S,et al.Dual-supervised attention network for deep cross-modal hashing[J].Pattern Recognition Letters,2019,128:333-339. [49] WANG X,ZOU X,BAKKER E M,et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing,2020,400:255-271. [50] LIN Q,CAO W,HE Z,et al.Semantic deep cross-modal hashing[J].Neurocomputing,2020,396:113-122. [51] LI F,WANG T,ZHU L,et al.Task-adaptive asymmetric deep cross-modal hashing[J].Knowledge-Based Systems,2021,219:106851. [52] SHEN X,ZHANG H,LI L,et al.Semi-supervised cross-modal hashing with multi-view graph representation[J].Information Sciences,2022,604:45-60. [53] ZHANG J,PENG Y,YUAN M.SCH-GAN:semi-supervised cross-modal hashing by generative adversarial network[J].arXiv:1802.02488,2018. [54] WANG X,LIU X,PENG S,et al.Semi-supervised discrete hashing for efficient cross-modal retrieval[J].Multimedia Tools and Applications,2020,79(35/36):25335-25336. [55] LI D,DU C,WANG H,et al.Deep modality assistance co-training network for semi-supervised multi-label semantic decoding[J].IEEE Transactions on Multimedia,2021(24):3287-3299. [56] ?ANCULEF R,MENA F,MACALUSO A,et al.Self-supervised Bernoulli autoencoders for semi-supervised hashing[C]//Iberoamerican Congress on Pattern Recognition,Porto,May 10-13,2021.Berlin:Springer,2021:258-268. [57] DING G,GUO Y,ZHOU J.Collective matrix factorization hashing for multimodal data[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Columbus,Jun 23-28,2014.Piscataway:IEEE,2014:2075-2082. [58] CHENG M,JING L,NG M K.Robust unsupervised cross-modal hashing for multimedia retrieval[J].ACM Transactions on Information Systems,2020,38(3):1-25. [59] LI M,LI Q,TANG L,et al.Deep unsupervised hashing for large-scale cross-modal retrieval using knowledge distillation model[J].Computational Intelligence and Neuroscience,2021:1-11. [60] LIU Y,WU J,QU L,et al.Self-supervised correlation learning for cross-modal retrieval[J].IEEE Transactions on Multimedia,2022:5107034. [61] YU J,ZHOU H,ZHAN Y,et al.Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing[C]//Proceedings of the AAAI Conference on Artificial Intelligence,Held Virtually,Feb 2-9,2021.Palo Alto:AAAI,2021:4626-4634. [62] LIN Q,CAO W,HE Z,et al.Mask cross-modal hashing networks[J].IEEE Transactions on Multimedia,2020,23:550-558. [63] SHI G,LI F,WU L,et al.Object-level visual-text correlation graph hashing for unsupervised cross-modal retrieval[J].Sensors,2022,22(8):2921. [64] LUO J,WO Y,WU B,et al.Learning sufficient scene representation for unsupervised cross-modal retrieval[J].Neurocomputing,2021,461:404-418. [65] RASIWASIA N,COSTA PEREIRA J,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia,Firenze,Italy,Oct 25-29,2010.New York:ACM,2010:251-260. [66] RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s mechanical turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk,Los Angeles,Jun 6-10,2010.Stroudsburg:ACL,2010:139-147. [67] CHUA T,TANG J,HONG R,et al.Nus-wide:a real-world web image database from national university of singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval,2009:1-9. [68] DENG J,DONG W,SOCHER R,et al.Imagenet:a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition,Miami,Jun 20-25,2009.Piscataway:IEEE,2009:248-255. [69] CHEN X,FANG H,LIN T,et al.Microsoft coco captions:data collection and evaluation server[J].arXiv:1504.00325,2015. [70] HUISKES M J,LEW M S.The mir flickr retrieval evaluation[C]//Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval,Vancouver,Oct 30-31,2008:39-43. [71] KOU F,DU J,CUI W,et al.Common semantic representation method based on object attention and adversarial learning for cross-modal data in IoV[J].IEEE Transactions on Vehicular Technology,2019,68(12):11588-11598. [72] SHI L,DU J,CHENG G,et al.Cross-media search method based on complementary attention and generative adversarial network for social networks[J].International Journal of Intelligent Systems,2021,37(8):4393-4416. [73] MISRAA A K,KALE A,AGGARWAL P,et al.Multi-modal retrieval using graph neural networks[J].arXiv:2010.01666,2020. |
[1] | 李翔, 张涛, 张哲, 魏宏杨, 钱育蓉. Transformer在计算机视觉领域的研究综述[J]. 计算机工程与应用, 2023, 59(1): 1-14. |
[2] | 万朵, 胡谋法, 肖山竹, 张焱. 面向边缘智能计算的异构并行计算平台综述[J]. 计算机工程与应用, 2023, 59(1): 15-25. |
[3] | 王鹏, 王玉林, 焦博文, 王洪昌, 于奕轩. 基于YOLOv5的道路目标检测算法研究[J]. 计算机工程与应用, 2023, 59(1): 117-125. |
[4] | 刘钊, 杨帆, 司亚中. 时域非填充网络视频行为识别算法研究[J]. 计算机工程与应用, 2023, 59(1): 162-168. |
[5] | 郭志涛, 周峰, 赵琳琳, 袁金丽, 卢成钢. 边缘保护与多阶段网络相结合的LDCT图像去噪[J]. 计算机工程与应用, 2023, 59(1): 252-258. |
[6] | 谭荣杰, 洪智勇, 余文华, 曾志强. 非独立同分布数据下的去中心化联邦学习策略[J]. 计算机工程与应用, 2023, 59(1): 269-277. |
[7] | 高广尚. 深度学习推荐模型中的注意力机制研究综述[J]. 计算机工程与应用, 2022, 58(9): 9-18. |
[8] | 吉梦, 何清龙. AdaSVRG:自适应学习率加速SVRG[J]. 计算机工程与应用, 2022, 58(9): 83-90. |
[9] | 罗向龙, 郭凰, 廖聪, 韩静, 王立新. 时空相关的短时交通流宽度学习预测模型[J]. 计算机工程与应用, 2022, 58(9): 181-186. |
[10] | 阿里木·赛买提, 斯拉吉艾合麦提·如则麦麦提, 麦合甫热提, 艾山·吾买尔, 吾守尔·斯拉木, 吐尔根·依不拉音. 神经机器翻译面对句长敏感问题的研究[J]. 计算机工程与应用, 2022, 58(9): 195-200. |
[11] | 陈一潇, 阿里甫·库尔班, 林文龙, 袁旭. 面向拥挤行人检测的CA-YOLOv5[J]. 计算机工程与应用, 2022, 58(9): 238-245. |
[12] | 方义秋, 卢壮, 葛君伟. 联合RMSE损失LSTM-CNN模型的股价预测[J]. 计算机工程与应用, 2022, 58(9): 294-302. |
[13] | 石颉, 袁晨翔, 丁飞, 孔维相. SAR图像建筑物目标检测研究综述[J]. 计算机工程与应用, 2022, 58(8): 58-66. |
[14] | 熊风光, 张鑫, 韩燮, 况立群, 刘欢乐, 贾炅昊. 改进的遥感图像语义分割研究[J]. 计算机工程与应用, 2022, 58(8): 185-190. |
[15] | 杨锦帆, 王晓强, 林浩, 李雷孝, 杨艳艳, 李科岑, 高静. 深度学习中的单阶段车辆检测算法综述[J]. 计算机工程与应用, 2022, 58(7): 55-67. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||