计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (20): 229-239.DOI: 10.3778/j.issn.1002-8331.2103-0358

• 图形图像处理 • 上一篇    下一篇

基于多模态注意力机制的跨模态哈希网络

吴吉祥,鲁芹,李伟霄   

  1. 1.齐鲁工业大学(山东省科学院) 计算机科学与技术学院,济南 250000
    2.中移动信息技术有限公司 内审部,北京 100000
  • 出版日期:2022-10-15 发布日期:2022-10-15

Cross-Modal Hashing Network Based on Multimodal Attention Mechanism

WU Jixiang, LU Qin, LI Weixiao   

  1. 1.College of Computer Science and Technology, Qilu University of Technology(Shandong Academy of Sciences), Jinan 250000, China 
    2.Internal Audit Department, China Mobile Information Technology Co., Ltd., Beijing 100000, China
  • Online:2022-10-15 Published:2022-10-15

摘要: 深度跨模态哈希算法(deep cross-modal Hash,DCMH)可以结合哈希算法存储成本低、检索速度快的优点,以及深度神经网络提取特征的强大能力,得到了越来越多的关注。它可以有效地将模态的特征和哈希表示学习集成到端到端框架中。然而在现有的DCMH方法的特征提取中,基于全局表示对齐的方法无法准确定位图像和文本中有语义意义的部分,导致在保证检索速度的同时无法保证检索的精确度。针对上述问题,提出了一种基于多模态注意力机制的跨模态哈希网络(HX_MAN),将注意力机制引入到DCMH方法中来提取不同模态的关键信息。利用深度学习来提取图像和文本模态的全局上下文特征,并且设计了一种多模态交互门来将图像和文本模态进行细粒度的交互,引入多模态注意力机制来更精确地捕捉不同模态内的局部特征信息,将带有注意的特征输入哈希模块以获得二进制的哈希码;在实行检索时,将任一模态的数据输入训练模块中来获得哈希码,计算该哈希码与检索库中哈希码的汉明距离,最终根据汉明距离按顺序输出另一种模态的数据结果。实验结果表明:HX_MAN模型与当前现有的DCMH方法相比更具有良好的检索性能,在保证检索速度的同时,能够更准确地提炼出图像和文本模态的局部细粒度特征,提高了检索的精确度。

关键词: 跨模态检索, 注意力机制, 深度哈希, 多模态学习

Abstract: Deep cross-modal Hash(DCMH) algorithm can combine the advantages of low storage cost and fast retrieval speed of Hash algorithm, as well as the powerful ability of deep neural network to extract features, which has attracted more and more attention. It can effectively integrate modal features and Hash representation learning into the end-to-end framework. However, in the existing feature extraction of DCMH method, the method based on global representation alignment can not accurately locate the semantic parts of images and texts, which leads to the failure to guarantee the retrieval speed and accuracy. To solve the above problems, it proposes a cross-modal Hash network  based on multi-modal attention mechanism(HX_MAN), which introduces attention mechanism into DCMH method to extract key information of different modals. Firstly, it makes use of deep learning to extract the global context features of image and text. Besides, it designs a multi-modal interaction gate to carry out fine-grained interaction between image and text. Then it introduces a multi-modal attention mechanism to capture local feature information in different modals more accurately. Finally, the features with attention are input into the Hash module to obtain binary hash codes. When it carries out retrieval, image or text is input into the training module to obtain the Hash code. Then the Hamming distance between this hash code and the hash code in the retrieval library is calculated. And finally the result of the retrieval(text/image) is output according to the Hamming distance. Experimental results show that the HX_MAN has better retrieval performance than some current DCMH methods. While ensuring the retrieval speed, it can more accurately extract the local fine-grained features of image and text, and improve the retrieval accuracy.

Key words: cross-modal retrieval, attention mechanism, deep Hash, multai-modal learning