计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (23): 24-37.DOI: 10.3778/j.issn.1002-8331.2503-0014

• 热点与综述 • 上一篇    下一篇

视觉Transformer在细粒度图像分类中的应用综述

温世雄,智敏   

  1. 内蒙古师范大学 计算机科学技术学院,呼和浩特 010022
  • 出版日期:2025-12-01 发布日期:2025-12-01

Survey of Vision Transformers for Fine-Grained Image Classification

WEN Shixiong, ZHI Min   

  1. College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China
  • Online:2025-12-01 Published:2025-12-01

摘要: 细粒度图像分类(fine-grained image classification,FGIC)旨在识别视觉上高度相似但存在细微差异的子类别。随着深度学习的快速发展,FGIC算法已由传统强监督学习逐步发展至弱监督学习。视觉Transformer(ViT)凭借其多头自注意力机制,无须依赖手工标注,同时克服了基于卷积神经网络(CNN)算法在感受野和全局建模能力上的局限性,成为该任务的主流方法之一。对FGIC的特点与难点进行概述,简要介绍ViT的基本架构及其优势。根据不同的特征融合策略将基于ViT的改进算法分成层次、多局部及多粒度三种特征融合方法,对每类方法的改进方式进行详细的图示说明,并对各类技术方法的机制进行详细阐述和总结分析。梳理了常用的公开数据集,并根据当前研究的局限性提出未来的研究方向,以进一步挖掘ViT在细粒度图像分类任务中的应用潜力。

关键词: 细粒度图像分类(FGIC), 视觉Transformer(ViT), 特征融合

Abstract: Fine-grained image classification (FGIC) aims to identify subcategories that are visually highly similar yet exhibit subtle differences. With the rapid advancement of deep learning, FGIC algorithms have gradually evolved from traditional fully supervised learning to weakly supervised approaches. Vision Transformers (ViTs), leveraging multi-head self-attention mechanisms, eliminate the reliance on manual annotations and overcome the limitations of convolutional neural networks (CNNs) in terms of receptive field size and global modeling capacity, becoming one of the mainstream methods for this task. This paper first outlines the key characteristics and challenges of FGIC, and briefly introduces the architecture and advantages of ViT. Based on different feature fusion strategies, existing ViT-based improvements are categorized into hierarchical fusion, multi-local fusion, and multi-granularity fusion. The modifications of each category are illustrated in detail, and their underlying mechanisms are systematically analyzed and summarized. In addition, commonly used public datasets are reviewed, and future research directions are proposed based on current limitations, aiming to further explore the potential of ViT in FGIC tasks.

Key words: fine-grained image classification(FGIC), vision Transformer(ViT), feature fusion