Multi-Attention Ensemble for Image Retrieval

doi:10.3778/j.issn.1002-8331.2106-0517

Abstract

Abstract: Aiming at the problem that the features, which are output by second-order attention based on the relations among all input features, are full of redundant information and each branch in ensemble methods can not be effectively trained, a multi-attention ensemble method is proposed for image retrieval. This method utilizes the SASA（stand-alone self-attention）, which performs well in image classification task, to capture the relations among every feature and its neighborhoods to produce more powerful features for retrieval. This method proposes a multi-attention ensemble framework to generate effective features from every attentional branch with SASA. These features are used to effectively combine into the final image feature. Moreover, this framework uses a ranking loss from the final image feature, divergence loss from all branches, and classification losses from each branch to jointly train the model. Experiments on CUB200-2011 and CARS196 retrieval datasets demonstrate that the proposed method can significantly improve retrieval accuracy.

Key words: stand-alone self-attention, attention ensemble, image retrieval

摘要： 针对图像检索方法中二阶注意力模块使用全局特征之间的联系所生成的特征存在大量冗余信息，以及集成机制中各分支不能充分训练的问题，提出一种基于多注意力集成的图像检索方法。该方法利用在图像分类任务中表现良好的独立自注意力模块捕捉局部特征之间的联系，生成质量更高的特征以用于图像检索。该方法提出一个多注意力集成框架，在各注意力分支中分别利用独立自注意力模块产生相应的高效图像特征，并通过有效结合产生最终的图像特征。多注意力集成框架利用最终图像特征的排序损失、各注意力分支之间的差异损失及各分支的图像分类损失对模型进行联合训练，使各分支能得到充分训练。在CUB200-2011及CARS196图像检索数据集上的实验表明，所提方法可以有效提高检索精度。

关键词: 独立自注意力, 注意力集成, 图像检索

ZENG Aibo, CHEN Youguang. Multi-Attention Ensemble for Image Retrieval[J]. Computer Engineering and Applications, 2022, 58(24): 205-211.

曾爱博, 陈优广. 多注意力集成的图像检索[J]. 计算机工程与应用, 2022, 58(24): 205-211.

References

[1] LOWE D G.Distinctive image features from scale-invariant keypoints[J].International Journal of Computer Vision，2004，60（2）：91-110.
[2] ARANDJELOVI? R，ZISSERMAN A.Three things everyone should know to improve object retrieval[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition，2012：2911-2918.
[3] JéGOU H，DOUZE M，SCHMID C，et al.Aggregating local descriptors into a compact image representation[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2010：3304-3311.
[4] PERRONNIN F，LIU Y，SáNCHEZ J，et al.Large-scale image retrieval with compressed fisher vectors[C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition，2010：3384-3391.
[5] BABENKO A，SLESAREV A，CHIGORIN A，et al.Neural codes for image retrieval[C]//European Conference on Computer Vision.Cham：Springer，2014：584-599.
[6] BABENKO A，LEMPITSKY V.Aggregating local deep features for image retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：1269-1277.
[7] KALANTIDIS Y，MELLINA C，OSINDERO S.Cross-dimensional weighting for aggregated deep convolutional features[C]//European Conference on Computer Vision.Cham：Springer，2016：685-701.
[8] TOLIAS G，SICRE R，JéGOU H.Particular object retrieval with integral max-pooling of CNN activations[C]//International Conference on Learning Representations，2016：1-12.
[9] RADENOVI? F，TOLIAS G，CHUM O.Fine-tuning CNN image retrieval with no human annotation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，41（7）：1655-1668.
[10] MIN W，MEI S，LI Z，et al.A two-stage triplet network training framework for image retrieval[J].IEEE Transactions on Multimedia，2020，22（12）：3128-3138.
[11] DENG J，DONG W，SOCHER R，et al.Imagenet：a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition，2009：248-255.
[12] GORDO A，ALMAZAN J，REVAUD J，et al.End-to-end learning of deep visual representations for image retrieval[J].International Journal of Computer Vision，2017，124（2）：237-254.
[13] CHEN W，CHEN X，ZHANG J，et al.Beyond triplet loss：a deep quadruplet network for person re-identification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：403-412.
[14] SOHN K.Improved deep metric learning with multi-class [n]-pair loss objective[C]//Proceedings of the 30th Interntional Conference on Neural Information Processing Systems，2016：1857-1865.
[15] SHEN C，ZHOU C，JIN Z，et al.Learning feature embedding with strong neural activations for fine-grained retrieval[C]//Proceedings of the on Thematic Workshops of ACM Multimedia，2017：424-432.
[16] JUN H J，KO B S，KIM Y，et al.Combination of multiple global descriptors for image retrieval[J].arXiv：1903.10663，2019.
[17] NOH H，ARAUJO A，SIM J，et al.Large-scale image retrieval with attentive deep local features[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：3456-3465.
[18] NIE X，LU H，WANG Z，et al.Weakly supervised image retrieval via coarse-scale feature fusion and multi-level attention blocks[C]//Proceedings of the 2019 on International Conference on Multimedia Retrieval，2019：48-52.
[19] GU Y，LI C，XIE J.Attention-aware generalized mean pooling for image retrieval[J].arXiv：1811.00202，2018.
[20] WU X，IRIE G，HIRAMATSU K，et al.Weighted generalized mean pooling for deep image retrieval[C]//2018 25th IEEE International Conference on Image Processing（ICIP），2018：495-499.
[21] NG T，BALNTAS V，TIAN Y，et al.SOLAR：second-order loss and attention for image retrieval[C]//European Conference on Computer Vision.Cham：Springer，2020：253-270.
[22] RAMACHANDRAN P，PARMAR N，VASWANI A，et al.Stand-alone self-attention in vision models[J].arXiv：1906.
05909，2019.
[23] KIM W，GOYAL B，CHAWLA K，et al.Attention-based ensemble for deep metric learning[C]//Proceedings of the European Conference on Computer Vision，2018：736-751.
[24] OPITZ M，WALTNER G，POSSEGGER H，et al.Deep metric learning with bier：boosting independent embeddings robustly[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，42（2）：276-290.
[25] KRAUSE J，STARK M，DENG J，et al.3d object representations for fine-grained categorization[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops，2013：554-561.
[26] WAH C，BRANSON S，WELINDER P，et al.The caltech-ucsd birds-200-2011 dataset[R].California Institute of Technology，2011.
[27] SIMONYAN K，ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv：1409.
1556，2014.
[28] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[29] HERMANS A，BEYER L，LEIBE B.In defense of the triplet loss for person re-identification[J].arXiv：1703.07737，2017.
[30] ZHANG X，YU F X，KARAMAN S，et al.Heated-up softmax embedding[J].arXiv：1809.04157，2018.
[31] SZEGEDY C，VANHOUCKE V，IOFFE S，et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2818-2826.
[32] KINGMA D P，BA J.Adam：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[33] WANG H，WANG Y，ZHOU Z，et al.Cosface：large margin cosine loss for deep face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：5265-5274.
[34] ZHAI A，WU H Y.Classification is a strong baseline for deep metric learning[J].arXiv：1811.12649，2018.
[35] WANG X，HUA Y，KODIROV E，et al.Ranked list loss for deep metric learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5207-5216.