基于多模态注意力机制的跨模态哈希网络

doi:10.3778/j.issn.1002-8331.2103-0358

摘要/Abstract

摘要： 深度跨模态哈希算法（deep cross-modal Hash，DCMH）可以结合哈希算法存储成本低、检索速度快的优点，以及深度神经网络提取特征的强大能力，得到了越来越多的关注。它可以有效地将模态的特征和哈希表示学习集成到端到端框架中。然而在现有的DCMH方法的特征提取中，基于全局表示对齐的方法无法准确定位图像和文本中有语义意义的部分，导致在保证检索速度的同时无法保证检索的精确度。针对上述问题，提出了一种基于多模态注意力机制的跨模态哈希网络（HX_MAN），将注意力机制引入到DCMH方法中来提取不同模态的关键信息。利用深度学习来提取图像和文本模态的全局上下文特征，并且设计了一种多模态交互门来将图像和文本模态进行细粒度的交互，引入多模态注意力机制来更精确地捕捉不同模态内的局部特征信息，将带有注意的特征输入哈希模块以获得二进制的哈希码；在实行检索时，将任一模态的数据输入训练模块中来获得哈希码，计算该哈希码与检索库中哈希码的汉明距离，最终根据汉明距离按顺序输出另一种模态的数据结果。实验结果表明：HX_MAN模型与当前现有的DCMH方法相比更具有良好的检索性能，在保证检索速度的同时，能够更准确地提炼出图像和文本模态的局部细粒度特征，提高了检索的精确度。

关键词: 跨模态检索, 注意力机制, 深度哈希, 多模态学习

Abstract: Deep cross-modal Hash（DCMH） algorithm can combine the advantages of low storage cost and fast retrieval speed of Hash algorithm, as well as the powerful ability of deep neural network to extract features, which has attracted more and more attention. It can effectively integrate modal features and Hash representation learning into the end-to-end framework. However, in the existing feature extraction of DCMH method, the method based on global representation alignment can not accurately locate the semantic parts of images and texts, which leads to the failure to guarantee the retrieval speed and accuracy. To solve the above problems, it proposes a cross-modal Hash network based on multi-modal attention mechanism（HX_MAN）, which introduces attention mechanism into DCMH method to extract key information of different modals. Firstly, it makes use of deep learning to extract the global context features of image and text. Besides, it designs a multi-modal interaction gate to carry out fine-grained interaction between image and text. Then it introduces a multi-modal attention mechanism to capture local feature information in different modals more accurately. Finally, the features with attention are input into the Hash module to obtain binary hash codes. When it carries out retrieval, image or text is input into the training module to obtain the Hash code. Then the Hamming distance between this hash code and the hash code in the retrieval library is calculated. And finally the result of the retrieval（text/image） is output according to the Hamming distance. Experimental results show that the HX_MAN has better retrieval performance than some current DCMH methods. While ensuring the retrieval speed, it can more accurately extract the local fine-grained features of image and text, and improve the retrieval accuracy.

Key words: cross-modal retrieval, attention mechanism, deep Hash, multai-modal learning

吴吉祥, 鲁芹, 李伟霄. 基于多模态注意力机制的跨模态哈希网络[J]. 计算机工程与应用, 2022, 58(20): 229-239.

WU Jixiang, LU Qin, LI Weixiao. Cross-Modal Hashing Network Based on Multimodal Attention Mechanism[J]. Computer Engineering and Applications, 2022, 58(20): 229-239.

参考文献

[1] QIN Q，WEI Z，HUANG L，et al.Deep top similarity hashing with class-wise loss for multi-label image retrieval[J].Neurocomputing，2021，439：302-315.
[2] PENG D，YANG W，LIU C，et al.SAM-GAN：self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis[J].Neural Networks，2021，138：57-67.
[3] ZHAO G，ZHANG M，LI Y，et al.Pyramid regional graph representation learning for content-based video retrieval[J].Information Processing & Management，2021，58（3）：102488.
[4] ZANGERLE E，PICHL M，SCHEDL M.User models for culture-aware music recommendation：fusing acoustic and cultural cues[J].Transactions of the International Society for Music Information Retrieval，2020，3（1）：1-16.
[5] DAGA I，GUPTA A，VARDHAN R，et al.Prediction of likes and retweets using text information retrieval[J].Procedia Computer Science，2020，168：123-128.
[6] PADMAPRIYA G，DURAISWAMY K.Multi-document-based text summarisation through deep learning algorithm[J].International Journal of Business Intelligence and Data Mining，2020，16（4）：459-479.
[7] KIVRAK M，GULDOGAN E，COLAK C.Prediction of death status on the course of treatment in SARS-COV-2 patients with deep learning and machine learning methods[J].Computer Methods and Programs in Biomedicine，2021，201：105951.
[8] UTKU K，OMER D，JUDE H.Deep learning for biomedical applications[M].[S.l.]：CRC Press，2021.
[9] LEE K，CHEN X，HUA G，et al.Stacked cross attention for image-text matching[C]//European Conference on Computer Vision，2018：201-216.
[10] WANG X，ZOU X，BAKKER E M，et al.Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval[J].Neurocomputing，2020，400：255-271.
[11] YUAN M，PENG Y.Bridge-GAN：interpretable representation learning for text-to-image synthesis[J].IEEE Transactions on Circuits and Systems for Video Technology，2019，30（11）：4258-4268.
[12] QU W，WANG D，FENG S，et al.A novel cross-modal hashing algorithm based on multimodal deep learning[J].Science China Information Sciences，2017，60（9）：1-14.
[13] DING G，GUO Y，ZHOU J.Collective matrix factorization hashing for multimodal data[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2014.
[14] ZHANG D，LI W J.Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Twenty-Eighth AAAI Conference on Artificial Intelligence，2014.
[15] WANG D，GAO X，WANG X.Semantic topic multimodal hashing for cross-media retrieval[C]//Proceedings of the 24th International Joint Conference on Artificial Intelligence，2015：3890-3896.
[16] LIN Z，DING G，HU M，et al.Semantics-preserving hashing for cross-view retrieval[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2015.
[17] JIANG Q Y，LI W J.Deep cross-modal hashing[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017：3232-3240.
[18] LIN Q，CAO W，HE Z，et al.Semantic deep cross-modal hashing[J].Neurocomputing，2020，396：113-122.
[19] MNIH V，HEESS N，GRAVES A.Recurrent models of visual attention[C]//Advances in Neural Information Processing Systems，2014：2204-2212.
[20] YANG M，ZHANG M，CHEN K，et al.Neural machine translation with target-attention model[J].IEICE Transactions on Information and Systems，2020，103（3）：684-694.
[21] FU Q，WANG C，HAN X.A CNN-LSTM network with attention approach for learning universal sentence representation in embedded system[J].Microprocessors and Microsystems，2020，74：103051.
[22] YU X，FENG W，WANG H，et al.An attention mechanism and multi-granularity-based Bi-LSTM model for Chinese Q&A system[J].Soft Computing，2020，24（8）：5831-5845.
[23] MA W，YANG Q，WU Y，et al.Double-branch multi-attention mechanism network for hyperspectral image classification[J].Remote Sensing，2019，11（11）：1307.
[24] ZHU Y，LI R，YANG Y，et al.Learning cascade attention for fine-grained image classification[J].Neural Networks，2020，122：174-182.
[25] GREGOR K，DANIHELKA I，GRAVES A，et al.Draw：a recurrent neural network for image generation[C]//International Conference on Machine Learning，2015：1462-1471.
[26] ZHANG Q，SHI Y，ZHANG X.Attention and boundary guided salient object detection[J].Pattern Recognition，2020，107：107484.
[27] LIU Y，ZHANG X，HUANG F，et al.Visual question answering via combining inferential attention and semantic space mapping[J].Knowledge-Based Systems，2020，207：106339.
[28] LI W，SUN J，LIU G，et al.Visual question answering with attention transfer and a cross-modal gating mechanism[J].Pattern Recognition Letters，2020，133：334-340.
[29] CAO D，CHU J，ZHU N，et al.Cross-modal recipe retrieval via parallel-and cross-attention networks learning[J].Knowledge-Based Systems，2020，193：105428.
[30] PENG H，HE J，CHEN S，et al.Dual-supervised attention network for deep cross-modal hashing[J].Pattern Recognition Letters，2019，128：333-339.
[31] PENG X，ZHANG X，LI Y，et al.Research on image feature extraction and retrieval algorithms based on convolutional neural network[J].Journal of Visual Communication and Image Representation，2020，69：102705.
[32] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[33] WANG Y，YANG H，BAI X，et al.PFAN++：bi-directional image-text retrieval with position focused attention network[J].IEEE Transactions on Multimedia，2020，23：3362-3376.
[34] QIAO B，FAN Z，WANG R，et al.A comparative study of image features and similarity measurement methods in cross-modal retrieval of commodity images[C]//2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications（AEECA），2020.
[35] LENG J，LIU Y，CHEN S.Context-aware attention network for image recognition[J].Neural Computing and Applications，2019，31（12）：9295-9305.
[36] WEN K，GU X，CHENG Q.Learning dual semantic relations with graph attention for image-text matching[J].IEEE Transactions on Circuits and Systems for Video Technology，2021，31（7）：2866-2879.
[37] JI Z，WANG H，HAN J，et al.SMAN：stacked multimodal attention network for cross-modal image-text retrieval[J].IEEE Transactions on Cybernetics，2022，52（2）：1086-1097.
[38] WU Y，WANG S，SONG G，et al.Learning fragment self-attention embeddings for image-text matching[C]//Proceedings of the 27th ACM International Conference on Multimedia，2019：2088-2096.
[39] CHUA T S，TANG J，HONG R，et al.NUS-WIDE：a real-world web image database from National University of Singapore[C]//ACM International Conference on Image & Video Retrieval，2009.
[40] HUISKES M J，LEW M S.The MIR flickr retrieval evaluation[C]//ACM International Conference on Multimedia Information Retrieval，2008.
[41] ESCALANTE H J，HERNáNDEZ C A，GONZALEZ J A，et al.The segmented and annotated IAPR TC-12 benchmark[J].Computer Vision and Image Understanding，2010，114（4）：419-428.
[42] HARDOON D R，SZEDMAK S，SHAWE-TAYLOR J.Canonical correlation analysis：an overview with application to learning methods[J].Neural Computation，2004，16（12）：2639-2664.
[43] WANG K，HE R，WANG W，et al.Joint feature selection and subspace learning for cross-modal retrieval[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2016，38（10）：2010-2023.