Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (10): 30-46.DOI: 10.3778/j.issn.1002-8331.2310-0395

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey of  Vision Transformer in Fine-Grained Image Classification

SUN Lulu, LIU Jianping, WANG Jian, XING Jialu, ZHANG Yue, WANG Chenyang   

  1. 1.College of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China
    2.The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan 750021, China
    3.Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
  • Online:2024-05-15 Published:2024-05-15

细粒度图像分类上Vision Transformer的发展综述

孙露露,刘建平,王健,邢嘉璐,张越,王晨阳   

  1. 1.北方民族大学 计算机科学与工程学院,银川 750021
    2.北方民族大学 图像图形智能处理国家民委重点实验室,银川 750021
    3.中国农业科学院 农业信息研究所,北京 100081

Abstract: Fine-grained image classification (FGIC) has always been an important problem in computer vision. Compared to traditional image classification tasks, FGIC faces the challenge of extremely similar inter-class objects, which further increases the difficulty of the task. With the development of deep learning, Vision Transformer (ViT) models have become popular in the field of vision and have been introduced into FGIC tasks. This paper introduces the challenges faced by FGIC tasks, provides an overview of the ViT model, and analyzes its characteristics. The comprehensive review is primarily based on the model structure and covers FGIC algorithms based on ViT. It includes feature extraction, feature relation modeling, feature attention, and feature enhancement as the main aspects. Each algorithm is summarized, and its advantages and disadvantages are analyzed. Following that, a comparison of the performance of different ViT models on the same public dataset is conducted to validate their effectiveness in the FGIC tasks. Furthermore, the limitations of current research are pointed out, and future research directions are proposed to further explore the potential of ViT in FGIC.

Key words: fine-grained image classification, Vision Transformer, feature extraction, feature relation modeling, feature attention, feature enhancement

摘要: 细粒度图像分类(fine-grained image classification,FGIC)一直是计算机视觉领域中的重要问题。与传统图像分类任务相比,FGIC的挑战在于类间对象极其相似,使任务难度进一步增加。随着深度学习的发展,Vision Transformer(ViT)模型在视觉领域掀起热潮,并被引入到FGIC任务中。介绍了FGIC任务所面临的挑战,分析了ViT模型及其特性。主要根据模型结构全面综述了基于ViT的FGIC算法,包括特征提取、特征关系构建、特征注意和特征增强四方面内容,对每种算法进行了总结,并分析了它们的优缺点。通过对不同ViT模型在相同公用数据集上进行模型性能比较,以验证它们在FGIC任务上的有效性。最后指出了目前研究的不足,并提出未来研究方向,以进一步探索ViT在FGIC中的潜力。

关键词: 细粒度图像分类, Vision Transformer, 特征提取, 特征关系构建, 特征注意, 特征增强