计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (23): 12-23.DOI: 10.3778/j.issn.1002-8331.2205-0160
徐文婉,周小平,王佳
出版日期:
2022-12-01
发布日期:
2022-12-01
XU Wenwan, ZHOU Xiaoping, WANG Jia
Online:
2022-12-01
Published:
2022-12-01
摘要: 跨模态检索可以通过一种模态检索出其他模态的信息,已经成为大数据时代的研究热点。研究者基于实值表示和二进制表示两种方法来减小不同模态信息的语义差距并进行有效的相似度对比,但仍会有检索效率低或信息丢失的问题。目前,如何进一步提高检索效率和信息利用率是跨模态检索研究面临的关键挑战。介绍了跨模态检索研究中基于实值表示和二进制表示两种方法的发展现状;分析对比了包含两种表示技术下以建模技术和相似性对比为主线的五种跨模态检索方法:子空间学习、主题统计模型学习、深度学习、传统哈希和深度哈希;对最新的多模态数据集进行总结,为相关的研究和工程人员提供有价值的参考资料;分析了跨模态检索面临的挑战并指出了该领域未来研究方向。
徐文婉, 周小平, 王佳. 跨模态检索技术研究综述[J]. 计算机工程与应用, 2022, 58(23): 12-23.
XU Wenwan, ZHOU Xiaoping, WANG Jia. Overview of Cross-Modal Retrieval Technology[J]. Computer Engineering and Applications, 2022, 58(23): 12-23.
[1] WEN K Y,GU X D,CHENG Q R.Learning dual semantic relations with graph attention for image-text matching[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,31(7):2866-2879. [2] LIU J,YANG M,LI C,et al.Improving cross-modal image-text retrieval with teacher-student learning[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,31(8):3242-3253. [3] WANG W,SHEN Y,ZHANG H,et al.Semantic-rebased cross-modal hashing for scalable unsupervised text-visual retrieval[J].Information Processing & Management,2020,57(6):102374. [4] YUAN Z Q,ZHANG W K,RONG X E,et al.A lightweight multi-scale crossmodal text-image retrieval method in remote sensing[J].IEEE Transactions on Geoscience and Remote Sensing,2021,60:5612819. [5] NING H L,ZHAO B,YUAN Y.Semantics consistent representation learning for remote sensing image-voice retrieval[J].IEEE Transactions on Geoscience and Remote Sensing,2021,60:4700614. [6] QI A,GRYADITSKAYA Y,SONG J,et al.Toward fine-grained sketch-based 3d shape retrieval[J].IEEE Transactions on Image Processing,2021,30:8595-8606. [7] CHEN Q,CHEN Y N.Multi-view 3D model retrieval based on enhanced detail features with contrastive center loss[J].Multimedia Tools and Applications,2022,81(8):10407-10426. [8] GAO L L,LI X P,SONG J K,et al.Hierarchical LSTMs with adaptive attention for visual captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(5):1112-1131. [9] YANG X,WANG S S,DONG J,et al.Video moment retrieval with cross-modal neural architecture search[J].IEEE Transactions on Image Processing,2022,31:1204-1216. [10] IMURA J,FUJISAWA T,HARADA T,et al.Efficient multi-modal retrieval in conceptual space[C]//Proceedings of the 19th ACM International Conference on Multimedia(MM’11),2011:1085-1088. [11] KAUR P,PANNU H S,MALHI A K.Comparative analysis on cross-modal information retrieval:a review[J].Computer Science Review,2021,39:100336. [12] 任泽裕,王振超,柯尊旺,等.多模态数据融合综述[J].计算机工程与应用,2021,57(18):49-64. REN Z Y,WANG Z C,KE Z W,et al.Survey of multimodal data fusion[J].Computer Engineering and Application,2021,57(18):49-64. [13] PENG Y,HUANG X,ZHAO Y.An overview of cross-media retrieval:concepts,methodologies,benchmarks,and challenges[J].IEEE Transactions on Circuits and Systems for Video Technology,2018,28:2372-2385. [14] CHEN W,WANG W P,LIU L,et al.New ideas and trends in deep multimodal content understanding:a review[J].Neurocomputing,2021,426:195-215. [15] 陈宁,段友祥,孙歧峰.跨模态检索研究文献综述[J].计算机科学与探索,2021,15(8):1390-1404. CHEN N,DUAN Y X,SUN Q F.Literature review of cross modal retrieval research[J].Journal of Frontiers of Computer Science and Technology,2021,15(8):1390-1404. [16] JEON J,LAVRENKO V,MANMATHA R.Automatic image annotation and retrieval using cross-media relevance models[C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval(SIGIR’03),2003:119-126. [17] 张鸿,吴飞,庄越挺.跨媒体相关性推理与检索研究[J].计算机研究与发展,2008(5):869-876. ZHANG H,WU F,ZHUANG Y T.Cross-media correlation reasoning and retrieval[J].Journal of Computer Research and Development,2008(5):869-876. [18] RASIWASIA N,COSTA PEREIRA J,COVIELLO E,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia(MM’10),2010:251-260. [19] HWANG S J,GRAUMAN K.Learning the relative importance of objects from tagged images for retrieval and cross-modal search[J].International Journal of Computer Vision,2012,100(2):134-153. [20] RASIWASIA N,MAHAJAN D.Cluster canonical correlation analysis[C]//Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics,2014:823-831. [21] SHAO J,ZHAO Z,SU F,et al.Towards improving canonical correlation analysis for cross-modal retrieval[J].Proceedings of the on Thematic Workshops of ACM Multimedia,2017:332-339. [22] RANJAN V,RASIWASIA N.Multi-label cross-modal retrieval[C]//IEEE International Conference on Computer Vision(ICCV),2015:4094-4102. [23] SHU X,ZHAO G Y.Scalable multi-label canonical correlation analysis for cross-modal retrieval[J].Pattern Recognition,2021,115:107905. [24] TENENBAUM J B,FREEMAN W T.Separating style and content with bilinear models[J].Neural Computation,2000,12:1247-1283. [25] CHEN Y,WANG L,WANG W,et al.Continuum regression for cross-modal multi-media retrieval[C]//19th IEEE International Conference on Image Processing,2012:1949-1952. [26] PEREIRA J C,COVIELLO E,DOYLE G,et al.On the role of correlation and abstraction in cross-modal multimedia retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,36:521-535. [27] XU G,LI X,ZHIJUN Z.Semantic consistency cross-modal retrieval with semi-supervised graph regularization[J].IEEE Access,2020:14278-14288. [28] ZHANG L,MA B,LI G,et al.Generalized semi-supervised and structured subspace learning for cross-modal retrieval[J].IEEE Transactions on Multimedia,2018,20:128-141. [29] XU X,LIN K,GAO L,et al.Learning cross-modal common representations by private-shared subspaces separation[J].IEEE Transactions on Cybernetics,2022,52(5):3261-3275. [30] BLEI D M,JORDAN M I.Modeling annotated data[C]//Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval,2003:127-134. [31] WANG Y,WU F,SONG J,et al.Multi-modal mutual topic reinforce modeling for cross-media retrieval[J].Proceedings of the 22nd ACM International Conference on Multimedia,2014:307-316. [32] WU J,WU C L,LU J,et al.Region reinforcement network with topic constraint for image-text matching[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(1):388-397. [33] JIA Y,SALZMANN M,DARRELL T.Learning cross-modality similarity for multinomial data[C]//International Conference on Computer Vision,2011:2407-2414. [34] WU Y,WANG S,HUANG Q.Online fast adaptive low-rank similarity learning for cross-modal retrieval[J].IEEE Transactions on Multimedia,2020,22:1310-1322. [35] XIA D,MIAO L,FAN A.A cross-modal multimedia retrieval method using depth correlation mining in big data environment[J].Multimedia Tools and Applications,2020,79(1):1339-1354. [36] FENG F,WANG X,LI R.Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the 22nd ACM International Conference on Multimedia(MM’14),2014:7-16. [37] FENG F,LI R,WANG X.Deep correspondence restricted Boltzmann machine for cross-modal retrieval[J].Neurocomputing,2015,154:50-60. [38] JIANG B,YANG J,LV Z,et al.Internet cross-media retrieval based on deep learning[J].Journal of Visual Communication and Image Representation,2017,48:356-366. [39] DONG X F,LIU L,ZHU L,et al.Adversarial graph convolutional network for cross-modal retrieval[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(3):1634-1645. [40] PENG Y,QI J.CM-GANs:cross-modal generative adversarial networks for common representation learning[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2019,15(1):1-24. [41] KOU F,DU J,CUI W,et al.Common semantic representation method based on object attention and adversarial learning for cross-modal data in IoV[J].IEEE Transactions on Vehicular Technology,2019,68(12):11588-11598. [42] SHI L,DU J,CHENG G,et al.Cross-media search method based on complementary attention and generative adversarial network for social networks[J].International Journal of Intelligent Systems,2022,37(8):4393-4416. [43] XU X,LIN K,YANG Y,et al.Joint feature synthesis and embedding:adversarial cross-modal retrieval revisited[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(6):3030-3047. [44] HUANG X,PENG Y,YUAN M.MHTN:modal-adversarial hybrid transfer network for cross-modal retrieval[J].IEEE Ttransactions on Cybernetics,2020,50(3):1047-1059. [45] ZHEN L,HU P,PENG X,et al.Deep multimodal transfer learning for cross-modal retrieval[J].IEEE Transactions on Neural Networks and Learning Systems,2022,33(2):798-810. [46] CAO W,LIN Q,HE Z,et al.Hybrid representation learning for cross-modal retrieval[J].Neuro-Computing,2019,345:45-57. [47] HU P,ZHEN L,PENG D,et al.Scalable deep multimodal learning for cross-modal retrieval[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval,2019:635-644. [48] YU J,WU X J,ZHANG D.Unsupervised multi-modal hashing for cross-modal retrieval[J].Cognitive Computation,2022,14(3):1159-1171. [49] YU J,WU X,KITTLER J.Learning discriminative hashing codes for cross-modal retrieval based on multi-view features[J].Pattern Analysis and Applications,2020,23(3):1421-1438. [50] SHEN H T,LIU L,YANG Y,et al.Exploiting subspace relation in semantic labels for cross-modal hashing[J].IEEE Transactions on Knowledge and Data Engineering,2021,33(10):3351-3365. [51] LIU X,HU Z,LING H,et al.MTFH:a matrix tri-factorization hashing framework for efficient cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(3):964-981. [52] ZHENG C,ZHU L,LU X,et al.Fast discrete collaborative multi-modal hashing for large-scale multimedia retrieval[J].IEEE Transactions on Knowledge and Data Engineering,2020,32(11):2171-2184. [53] WANG Y,LUO X,NIE L,et al.BATCH:a scalable asymmetric discrete cross-modal hashing[J].IEEE Transactions on Knowledge and Data Engineering,2021,33(11):3507-3519. [54] LIU Y,JI S,FU Q,et al.Latent semantic-enhanced discrete hashing for cross-modal retrieval[J].Applied Intelligence,2022:1-17. [55] CAO Y,LONG M,WANG J,et al.Deep visual-semantic hashing for cross-modal retrieval[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2016:1445-1454. [56] DENG C,CHEN Z,LIU X,et al.Triplet-based deep hashing network for cross-modal retrieval[J].IEEE Transactions on Image Processing,2018,27(8):3893-3903. [57] ZHANG X,LAI H,FENG J.Attention-aware deep adversarial hashing for cross-modal retrieval[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2018:614-629. [58] 吴吉祥,鲁芹,李伟霄.基于多模态注意力机制的跨模态哈希网络[J/OL].计算机工程与应用:1-14[2022-04-13].http://kns.cnki.net/kcms/detail/11.2127.TP.20210726.0859. 008.html. WU J X,LU Q,LI W X.A cross-modal hashing network based on multimodal attention mechanism[J].Computer Engineering and Applications:1-14[2022-04-13].http://kns.cnki.net/kcms/detail/11.2127.TP.20210726.0859.008.html. [59] JIANG Q,LI W.Deep cross-modal hashing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:3232-3240. [60] WANG X,ZOU X,BAKKER E M,et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing,2020,400:255-271. [61] XU X,SHEN F,YANG Y,et al.Learning discriminative binary codes for large-scale cross-modal retrieval[J].IEEE Transactions on Image Processing,2017,26(5):2494-2507. [62] LU X,ZHU L,CHENG Z,et al.Efficient discrete latent semantic hashing for scalable cross-modal retrieval[J].Signal Processing,2019,154:217-231. [63] ZHANG D,WU X,XU T,et al.Two-stage supervised discrete hashing for cross-modal retrieval[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2022:1-13. [64] LIU X,LI Z,WANG J,et al.Cross-modal zero-shot hashing[C]//IEEE International Conference on Data Mining(ICDM),2019:449-458. [65] XU X,LU H,SONG J,et al.Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval[J].IEEE Transactions on Cybernetics,2020,50(6):2400-2413. [66] ZHANG C,SONG J,ZHU X,et al.HCMSL:hybrid cross-modal similarity learning for cross-modal retrieval[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2021,17(1):1-22. [67] LI W,YANG S,WANG Y,et al.Multi-level similarity learning for image-text retrieval[J].Information Processing & Management,2021,58(1):102432. [68] LI Z,LU H,FU H,et al.Image-text bidirectional learning network based cross-modal retrieval[J].Neurocomputing,2022,483:148-159. [69] XIONG W,WANG S,ZHANG C,et al.WIKI-CMR:a web cross modality dataset for studying and evaluation of cross modality retrieval models[C]//IEEE International Conference on Multimedia and Expo(ICME),2013:1-6. [70] CHUA T S,TANG J,HONG R,et al.NUS-WIDE:a real-world web image database from national university of singapore[C]//Proceedings of the 8th ACM International Conference on Image and Video Retrieval,Santorini Island.New York:ACM,2009:1-9. [71] RASHTCHIAN C,YOUNG P,HODOSH M,et al.Collecting image annotations using Amazon’s Mechanical Turk[C]//Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk,2010:139-147. [72] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:new similarity metrics for semantic inference over event descriptions[J].Transactions of the Association for Computational Linguistics,2014,2:67-78. [73] LIN T,MAIRE M,BELONGIE S,et al.Microsoft COCO:common objects in context[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2014:740-755. [74] PENG Y,ZHAI X,ZHAO Y,et al.Semi-supervised cross-media feature learning with unified patch graph regularization[J].IEEE Transactions on Circuits and Systems for Video Technology,2016,26(3):583-596. [75] DONG X,ZHAN X,WU Y,et al.M5Product:self-harmonized contrastive learning for e-commercial multi-modal pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2022:21252-21262. |
[1] | 高广尚. 深度学习推荐模型中的注意力机制研究综述[J]. 计算机工程与应用, 2022, 58(9): 9-18. |
[2] | 吉梦, 何清龙. AdaSVRG:自适应学习率加速SVRG[J]. 计算机工程与应用, 2022, 58(9): 83-90. |
[3] | 罗向龙, 郭凰, 廖聪, 韩静, 王立新. 时空相关的短时交通流宽度学习预测模型[J]. 计算机工程与应用, 2022, 58(9): 181-186. |
[4] | 阿里木·赛买提, 斯拉吉艾合麦提·如则麦麦提, 麦合甫热提, 艾山·吾买尔, 吾守尔·斯拉木, 吐尔根·依不拉音. 神经机器翻译面对句长敏感问题的研究[J]. 计算机工程与应用, 2022, 58(9): 195-200. |
[5] | 陈一潇, 阿里甫·库尔班, 林文龙, 袁旭. 面向拥挤行人检测的CA-YOLOv5[J]. 计算机工程与应用, 2022, 58(9): 238-245. |
[6] | 方义秋, 卢壮, 葛君伟. 联合RMSE损失LSTM-CNN模型的股价预测[J]. 计算机工程与应用, 2022, 58(9): 294-302. |
[7] | 石颉, 袁晨翔, 丁飞, 孔维相. SAR图像建筑物目标检测研究综述[J]. 计算机工程与应用, 2022, 58(8): 58-66. |
[8] | 熊风光, 张鑫, 韩燮, 况立群, 刘欢乐, 贾炅昊. 改进的遥感图像语义分割研究[J]. 计算机工程与应用, 2022, 58(8): 185-190. |
[9] | 杨锦帆, 王晓强, 林浩, 李雷孝, 杨艳艳, 李科岑, 高静. 深度学习中的单阶段车辆检测算法综述[J]. 计算机工程与应用, 2022, 58(7): 55-67. |
[10] | 王斌, 李昕. 融合动态残差的多源域自适应算法研究[J]. 计算机工程与应用, 2022, 58(7): 162-166. |
[11] | 谭暑秋, 汤国放, 涂媛雅, 张建勋, 葛盼杰. 教室监控下学生异常行为检测系统[J]. 计算机工程与应用, 2022, 58(7): 176-184. |
[12] | 张美玉, 刘跃辉, 侯向辉, 秦绪佳. 基于卷积网络的灰度图像自动上色方法[J]. 计算机工程与应用, 2022, 58(7): 229-236. |
[13] | 张壮壮, 屈立成, 李翔, 张明皓, 李昭璐. 基于时空卷积神经网络的数据缺失交通流预测[J]. 计算机工程与应用, 2022, 58(7): 259-265. |
[14] | 许杰, 祝玉坤, 邢春晓. 基于深度强化学习的金融交易算法研究[J]. 计算机工程与应用, 2022, 58(7): 276-285. |
[15] | 张昊, 张小雨, 张振友, 李伟. 基于深度学习的入侵检测模型综述[J]. 计算机工程与应用, 2022, 58(6): 17-28. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||