结合Swin及多尺度特征融合的细粒度图像分类

doi:10.3778/j.issn.1002-8331.2211-0456

摘要/Abstract

摘要： 针对细粒度图像类间差异小、类内差异大等问题，提出了一种基于Swin及多尺度特征融合的模型（SwinFC）。基准骨干网络采用具有多阶段层级架构设计的Swin Transformer模型作为全新视觉特征提取器，从中获取局部和全局信息以及多尺度特征。然后在每个阶段的分支通道上嵌入融合外部依赖及跨空间注意力模块，以捕获数据样本之间的潜在相关性，同时捕捉不同空间方向上具有判别力的特征信息，进而强化网络每个阶段的信息表征。进一步地，引入特征融合模块将每个阶段提取的特征进行多尺度融合，促使网络学习更加全面、互补且多样化的特征信息。最后构建特征选择模块来筛选重要且具有辨别力的图像块，以此增大类间差异，减小类内差异，增强模型的判别力。实验结果表明，该方法在CUB-200-2011、NABirds和WebFG-496三个公开细粒度图像数据集上分别达到了92.5%、91.8%和85.84%的分类准确率，性能优于大部分主流模型方法，并且与基准模型Swin相比，分别提高了1.4、2.6和4.86个百分点的分类性能。

关键词: 细粒度图像分类, Swin Transformer, 注意力机制, 多尺度特征融合, 特征选择

Abstract: Challenged by high intra-class variances and low inter-class variances in fine-grained image classification, this paper proposes a fine-grained image classification model based on Swin and multi-scale feature fusion（SwinFC）. The Swin Transformer model with multi-stage hierarchical design is used as a new visual backbone network to extract local and global information and multi-scale features. Then, a module integrating external-dependency attention and cross-space attention is embedded on the branches of each stage, which aims to capture potential correlations among data samples and discriminative feature information from different spatial directions, enhancing the information representation in each stage of the network. Further, a feature fusion module is introduced to perform multi-scale fusion of the features extracted at each stage, so that the network can learn more comprehensive, complementary and diverse feature information. Finally, in order to enlarge inter-class differences, narrow the intra-class differences, a feature selection module is adopted to select important and discriminative image patches, enhancing the discriminative power of the network. Experimental results show that the proposed method achieves classification accuracy of 92.5%, 91.8% and 85.84% on three public fine-grained image datasets, CUB-200-2011, NABirds and WebFG-496, respectively, outperforming most of the mainstream methods in classification performance. Moreover, compared with the benchmark model Swin, the classification performance is improved by 1.4, 2.6 and 4.86 percentage points, respectively.

Key words: fine-grained image classification, Swin Transformer, attention mechanism, multi-scale feature fusion, feature selection

项剑文, 陈泯融, 杨百冰. 结合Swin及多尺度特征融合的细粒度图像分类[J]. 计算机工程与应用, 2023, 59(20): 147-157.

XIANG Jianwen, CHEN Minrong, YANG Baibing. Fine-Grained Image Classification Combining Swin and Multi-Scale Feature Fusion[J]. Computer Engineering and Applications, 2023, 59(20): 147-157.

参考文献

[1] ZHENG H L，FU J L，ZHA Z J，et al.Learning deep bilinear transformation for fine-grained image representation[C]//Proceedings of Annual Conference on Neural Information Processing Systems 2019，Dec 8-14，2019，Vancouver，BC，Canada，2019：4279-4288.
[2] GE W F，LIN X R，YU Y Z.Weakly supervised complementary parts models for fine-grained image classification from the bottom up[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition，2019：3034-3043.
[3] LIN T Y，ROYCHOWDHURY A，MAJI S.Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision，Santiago，Chile，December 7-13，2015：1449-1457.
[4] ZHENG S X，LU J C，ZHAO H S，et al.Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition，2021：6881-6890.
[5] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16×16 words：transformers for image recognition at scale[C]//Proceedings of the 9th International Conference on Learning Representations，Austria，May 3-7，2021.
[6] LIU Z，LIN Y T，HU H，et al.Swin transformer：hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE International Conference on Computer Vision，Montreal，QC，Canada，Oct 10-17，2021：9992-10002.
[7] HE J，CHEN J N，LIU S，et al.TransFG：a transformer architecture for fine-grained recognition[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence，the 34th Conference on Innovative Application of Artificial Intelligence，the 12th Symposium on Educational Advances in Artificial Intelligence，Feb 22-March 1，2022：852-860.
[8] SONG J W，YANG R Y.Feature boosting，suppression，and diversification for fine-grained visual classification[C]//Proceedings of International Joint Conference on Neural Networks，Shenzhen，China，July 18-22，2021：1-8.
[9] LIN T Y，DOLLAR P，GIRSHICK R，et al.Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），July 21-26 2017，Honolulu，HI，2017：936-944.
[10] LIU W，ANGUELOV D，ERHAN D，et al.SSD：single shot multibox detector[C]//Computer Vision ECCV 2016.Cham：Springer International Publishing，2016：21-37.
[11] CHEN X S，FU C M，ZHAO Y，et al.Salience-guided cascaded suppression network for person re-identification[C]//2020 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），June 13-19，2020，Seattle，WA，2020：3297-3307.
[12] GUO M H，LIU Z N，MU J T，et al.Beyond self-attention：external attention using two linear layers for visual tasks[J].arXiv：2105.02358，2021.
[13] WAH C，BRANSON S，WELINDER P，et al.The Caltech-UCSD Birds-200-2011 dataset[R].Pasadena：California Institute of Technology，2011.
[14] HORN V G，BRANSON S，HABER S，et al.Building a bird recognition app and large scale dataset with citizen scientists：the fine print in fine-grained dataset collection[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition，Boston，June 7-12，2015：595-604.
[15] SUN Z，YAO Y Z，WEI X S，et al.Webly supervised fine-grained recognition：benchmark datasets and an approach[C]//Proceedings of 2021 IEEE International Conference on Computer Vision，Montreal，QC，Canada，Oct 10-17，2021：10582-10591.
[16] 魏秀参，许玉燕，杨健.网络监督数据下的细粒度图像识别综述[J].中国图象图形学报，2022，27（7）：2057-2077.
WEI X S，XU Y Y，YANG J.Review of webly-supervised fine-grained image recognition[J].Journal of Image and Graphics，2022，27（7）：2057-2077.
[17] SUTSKEVER I，MARTENS J，DAHL G E，et al.On the importance of initialization and momentum in deep learning[C]//Proceedings of the 30th International Conference on Machine Learning，Atlanta，GA，June 16-21，2013：1139-1147.
[18] HE K，ZHANG X Y，REN S Q，et al.Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，June 27-30，2016：770-778.
[19] DUBEY A，GUPTA O，RASKAR R，et al.Maximum-entropy fine grained classification[C]//Proceedings of the Annual Conference on Neural Information Processing System，Montréal，Dec 3-8，2018：635-645.
[20] LUO W，YANG X T，MO X J，et al.Cross-x learning for fine-grained visual categorization[C]//Proceedings of the International Conference on Computer Vision，Oct 27-Nov 2，2019：8241-8250.
[21] 李宽宽，刘立波.双线性聚合残差注意力的细粒度图像分类模型[J].计算机科学与探索，2022，16（4）：938-949.
LI K K，LIU L B.Fine-grained image classification model based on bilinear aggregate residual attention[J].Journal of Frontiers of Computer Science and Technology，2022，16（4）：938-949.
[22] 丁文谦，余鹏飞，李海燕，等.基于Xception网络的弱监督细粒度图像分类[J].计算机工程与应用，2022，58（2）：235-243.
DING W Q，YU P F，LI H Y，et al.Weakly supervised fine-grained image classification based on Xception network[J].Computer Engineering and Applications，2022，58（2）：235-243.
[23] DU R Y，CHANG D L，BHUNIA A K，et al.Fine-grained visual classification via progressive multi-granularity training of jigsaw patches[C]//Proceedings of the 16th European Conference on Computer Vision，Glasgow，Aug 23-28，2020：153-168.
[24] ZHUANG P Q，WANG Y L，QIAO Y.Learning attentive pairwise interaction for fine-grained classification[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence，the 32nd Conference on Innovative Application of Artificial Intelligence，the 10th Symposium on Educational Advances in Artificial Intelligence，New York，Feb 7-12，2020：13130-13137.
[25] WANG J，YU X H，GAO Y S.Feature fusion vision transformer for fine-grained visual categorization[C]//Proceedings of 32nd British Machine Vision Conference，Nov 22-25，2021：170.
[26] CAI C L，ZHANG T K，WENG Z W，et al.A transformer architecture with adaptive attention for fine-grained visual classification[C]//Proceedings of the 7th International Conference on Computer and Communications，2021：863-867.
[27] KORSCH D，BODESHEIM P，DENALER J.Classification-specific parts for improving fine-grained visual categorization[C]//Proceedings of the 41th German Conference Pattern Recognition，Dortmund，Germany，Sep 10-13，2019：62-75.
[28] ZHANG L B，HUANG S L，TAO D H.Learning a mixture of granularity-specific experts for fine-grained categorization[C]//Proceedings of International Conference on Computer Vision，Seoul，Korea（South），Oct 27-Nov 2，2019：8330-8339.
[29] TOUVRON H，VEDALDI A，DOUZE M，et al.Fixing the train-test resolution discrepancy[C]//Proceedings of Annual Conference on Neural Information Processing Systems，Vancouver，Dec 8-14，2019：8250-8260.
[30] BEHERA A，WHARTON Z，HEWAGE P R P G，et al.Context-aware attentional pooling （cap） for fine-grained visual classification[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence，the 33rd Conference on Innovative Application of Artificial Intelligence，the 11th Symposium on Educational Advances in Artificial Intelligence，Feb 2-9，2021：929-937.
[31] KORSCH D，BODESHEIM P，DENZLER J.End-to-end learning of a fisher vector encoding for part features in fine-grained recognition[J].arXiv：2007.02080，2020.
[32] SIMONYAN K，ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations，San Diego，May 7-9，2015.
[33] SZEGEDY C，LIU W，JIA Y Q，et al.Going deeper with convolutions[C]//Proceedings of Conference on Computer Vision and Pattern Recognition，Boston，June 7-12，2015：1-9.
[34] MALACH E，SHALEV-SHWARTZ S.Decoupling“when to update” from“ how to update”[C]//Proceedings of Annual Conference on Neural Information Processing Systems，Long Beach，Dec 4-9，2017：960-970.
[35] HAN B，YAO Q M，YU X R，et al.Co-teaching：robust training of deep neural networks with extremely noisy labels[C]//Proceedings of Annual Conference on Neural Information Processing Systems，Montréal，Dec 3-8，2018：8536-8546.
[36] SHU J，XIE Q，YI L X，et al.Meta-Weight-Net：learning an explicit mapping for sample weighting[C]//Proceedings of Annual Conference on Neural Information Processing Systems，Vancouver，Dec 8-14，2019：1917-1928.
[37] SHU J，YUAN X，XU Z B.CMW-Net：learning a class-aware sample weighting mapping for robust deep learning[J].arXiv：2202.05613，2022.
[38] SELVARAJU R R，COGSWELL M，DAS A，et al.Grad-cam：visual explanations from deep networks via gradient-based localization[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision，Venice，Oct 22-29，2017.Washington：IEEE Computer Society，2017：618-626.