Transformer-Based Few-Shot and Fine-Grained Image Classification Method

doi:10.3778/j.issn.1002-8331.2207-0005

Abstract

Abstract: To address the problems of single similarity measure and poor fine-grained feature extraction in few-shot and fine-grained image classification tasks, a Transformer-based few-shot and fine-grained image classification method is proposed in this paper to overcome the problem of few-shot learning in fine-grained image classification due to the small number of samples and thus poor classification results. Firstly, it constructs a new module CBG Transformer Block with multi-axis attention module and convolution operator as the basic components, and improves the feature extraction ability of the network by repeated stacking of the module. Secondly, it adopts a dual similarity module consisting of relational network and cosine network for similarity measurement, which avoids the similarity bias caused by a single measure in the case of small training data. Finally, the final prediction results are obtained by calculating the average of the two similarity scores.The experimental results show that the proposed method respectively achieves 82.70%, 74.22% and 69.68% classification accuracy for the 5-way 5-shot task on three publicly available fine-grained image datasets, CUB-200-2011, Stanford Cars and Stanford Dogs. It can be seen that the proposed method has achieved excellent results in few-shot and fine-grained image classification tasks.

Key words: fine-grained image classification, few-shot learning, multi-axis attention, conv-block-grid（CBG） Transformer Block, dual similarity

摘要： 针对小样本细粒度图像分类任务中存在的相似性度量单一以及细粒度特征提取效果不佳的问题，提出了一种基于Transformer的小样本细粒度图像分类方法，克服了小样本学习在细粒度图像分类中由于样本数量较少从而分类效果较差的问题。构建以多轴注意力模块与卷积算子为基本组件的新模块CBG Transformer Block，通过该模块的重复堆叠提高了网络的特征提取能力；采用关系网络和余弦网络组成的双相似度模块进行相似性度量，避免了在训练数据量较小的情况下单一度量造成的相似性偏差；通过计算两个相似度得分的平均值得出最终预测结果。实验结果表明，提出的方法在CUB-200-2011、Stanford Cars和Stanford Dogs三个公开细粒度图像数据集上的5-way 5-shot任务分类精度分别达到了82.70%、74.22%和69.68%，可见在小样本细粒度图像分类任务中取得了优异效果。

关键词: 细粒度图像分类, 小样本学习, 多轴注意力, CBG Transformer Block, 双相似度

LU Yan, WANG Yangping, WANG Wenrun. Transformer-Based Few-Shot and Fine-Grained Image Classification Method[J]. Computer Engineering and Applications, 2023, 59(23): 219-227.

陆妍, 王阳萍, 王文润. 基于Transformer的小样本细粒度图像分类方法[J]. 计算机工程与应用, 2023, 59(23): 219-227.

References

[1] FINN C，ABBEEL P，LEVINE S.Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning，2017：1126-1135.
[2] SUNG F，YANG Y，LI Z，et al.Learning to compare：relation network for few-shot learning[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2018.
[3] LI W，WANG L，XU J，et al.Revisiting local descriptor based image-to-class measure for few-shot learning[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR），2019.
[4] 李祥霞，吉晓慧，李彬.细粒度图像分类的深度学习方法[J].计算机科学与探索，2021，15（10）：1830-1842.
LI X X，JI X H，LI B.Deep learning method for fine-grained image categorization[J].Journal of Frontiers of Computer Science and Technology，2021，15（10）：1830-1842.
[5] WANG Y，YAO Q，Kwok J T，et al.Generalizing from a few examples：a survey on few-shot learning[J].ACM Computing Surveys（CSUR），2020，53（3）：1-34.
[6] KOCH G，ZEMEL R，SALAKHUTDINOV R.Siamese neural networks for one-shot image recognition[C]//International Conference on Machine Learning，Lille，France，2015.
[7] VINYALS O，BLUNDELL C，LILLICRAP T，et al.Matching networks for one shot learning[C]//Neural Information Processing Systems，Barcelona，2016：3630-3638.
[8] SNELL J，SWERSKY K，ZEMEL R.Prototypical networks for few-shot learning[C]//Neural Information Processing Systems，Long Beach，2017：4077-4087.
[9] ZHANG C，CAI Y，LIN G，et al.Deepemd：few-shot image classification with differentiable earth mover’s distance and structured classifiers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：12203-12213.
[10] LI A，HUANG W，LAN X，et al.Boosting few-shot learning with adaptive margin loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：12576-12584.
[11] LI X X，WU J J，SUN Z，et al.BSNet：bi-similarity network for few-shot fine-grained image classification[J].IEEE Transactions on Image Processing，2021，30：1318-1331.
[12] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[13] WANG X，GIRSHICK R，GUPTA A，et al.Non-local neural networks[C]//IEEE Conference on Computer Vision and Pattern Recognition，2018：7794-7803.
[14] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16x16 words：transformersfor image recognition at scale[J].arXiv：2010.11929，2020.
[15] LIU Z，LIN Y，CAO Y，et al.Swin transformer：hierarchical vision transformer using shifted windows[C]//IEEE/CVF International Conference on Computer Vision，2021：10012-10022.
[16] TU Z Z，TALEBI H，ZHANG H，et al.MaxViT：multi-axis vision transformer[J].arXiv：2204.01697v4，2022.
[17] HO J，KALCHBRENNER N，WEISSENBORN D，et al.Axial attention in multidimensional transformers[J].arXiv：1912.12180，2019.
[18] WANG Q，WU B，ZHU P，et al.ECA-Net：efficient channel attention for deep convolutional neural networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2020.
[19] WAH C，BRANSON S，WELINDER P，et al.The caltech-ucsd birds-200-2011 dataset[D].California Institute of Technology，2011.
[20] KRAUSE J，STARK M，DENG J，et al.3D object representations for fine-grained categorization[C]//Proceedings of the IEEE International Conference on Computer Vsion Workshops，Sydney，2013：554-561.
[21] KHOSLA A，JAYADEVAPRAKASH N，YAO B，et al.Novel dataset for fine-grained image categorization：stanford dogs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Colorado Springs，USA，2012：3181866.
[22] SELVARAJU R R，COGSWELL M，DAS A，et al.Grad-cam：visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：618-626.