计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (20): 147-157.DOI: 10.3778/j.issn.1002-8331.2211-0456

• 图形图像处理 • 上一篇    下一篇

结合Swin及多尺度特征融合的细粒度图像分类

项剑文,陈泯融,杨百冰   

  1. 华南师范大学 计算机学院,广州 510631
  • 出版日期:2023-10-15 发布日期:2023-10-17

Fine-Grained Image Classification Combining Swin and Multi-Scale Feature Fusion

XIANG Jianwen, CHEN Minrong, YANG Baibing   

  1. School of Computer Science, South China Normal University, Guangzhou 510631, China
  • Online:2023-10-15 Published:2023-10-17

摘要: 针对细粒度图像类间差异小、类内差异大等问题,提出了一种基于Swin及多尺度特征融合的模型(SwinFC)。基准骨干网络采用具有多阶段层级架构设计的Swin Transformer模型作为全新视觉特征提取器,从中获取局部和全局信息以及多尺度特征。然后在每个阶段的分支通道上嵌入融合外部依赖及跨空间注意力模块,以捕获数据样本之间的潜在相关性,同时捕捉不同空间方向上具有判别力的特征信息,进而强化网络每个阶段的信息表征。进一步地,引入特征融合模块将每个阶段提取的特征进行多尺度融合,促使网络学习更加全面、互补且多样化的特征信息。最后构建特征选择模块来筛选重要且具有辨别力的图像块,以此增大类间差异,减小类内差异,增强模型的判别力。实验结果表明,该方法在CUB-200-2011、NABirds和WebFG-496三个公开细粒度图像数据集上分别达到了92.5%、91.8%和85.84%的分类准确率,性能优于大部分主流模型方法,并且与基准模型Swin相比,分别提高了1.4、2.6和4.86个百分点的分类性能。

关键词: 细粒度图像分类, Swin Transformer, 注意力机制, 多尺度特征融合, 特征选择

Abstract: Challenged by high intra-class variances and low inter-class variances in fine-grained image classification, this paper proposes a fine-grained image classification model based on Swin and multi-scale feature fusion(SwinFC). The Swin Transformer model with multi-stage hierarchical design is used as a new visual backbone network to extract local and global information and multi-scale features. Then, a module integrating external-dependency attention and cross-space attention is embedded on the branches of each stage, which aims to capture potential correlations among data samples and discriminative feature information from different spatial directions, enhancing the information representation in each stage of the network. Further, a feature fusion module is introduced to perform multi-scale fusion of the features extracted at each stage, so that the network can learn more comprehensive, complementary and diverse feature information. Finally, in order to enlarge inter-class differences, narrow the intra-class differences, a feature selection module is adopted to select important and discriminative image patches, enhancing the discriminative power of the network. Experimental results show that the proposed method achieves classification accuracy of 92.5%, 91.8% and 85.84% on three public fine-grained image datasets, CUB-200-2011, NABirds and WebFG-496, respectively, outperforming most of the mainstream methods in classification performance. Moreover, compared with the benchmark model Swin, the classification performance is improved by 1.4, 2.6 and 4.86 percentage points, respectively.

Key words: fine-grained image classification, Swin Transformer, attention mechanism, multi-scale feature fusion, feature selection