
计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (23): 24-37.DOI: 10.3778/j.issn.1002-8331.2503-0014
温世雄,智敏
出版日期:2025-12-01
发布日期:2025-12-01
WEN Shixiong, ZHI Min
Online:2025-12-01
Published:2025-12-01
摘要: 细粒度图像分类(fine-grained image classification,FGIC)旨在识别视觉上高度相似但存在细微差异的子类别。随着深度学习的快速发展,FGIC算法已由传统强监督学习逐步发展至弱监督学习。视觉Transformer(ViT)凭借其多头自注意力机制,无须依赖手工标注,同时克服了基于卷积神经网络(CNN)算法在感受野和全局建模能力上的局限性,成为该任务的主流方法之一。对FGIC的特点与难点进行概述,简要介绍ViT的基本架构及其优势。根据不同的特征融合策略将基于ViT的改进算法分成层次、多局部及多粒度三种特征融合方法,对每类方法的改进方式进行详细的图示说明,并对各类技术方法的机制进行详细阐述和总结分析。梳理了常用的公开数据集,并根据当前研究的局限性提出未来的研究方向,以进一步挖掘ViT在细粒度图像分类任务中的应用潜力。
温世雄, 智敏. 视觉Transformer在细粒度图像分类中的应用综述[J]. 计算机工程与应用, 2025, 61(23): 24-37.
WEN Shixiong, ZHI Min. Survey of Vision Transformers for Fine-Grained Image Classification[J]. Computer Engineering and Applications, 2025, 61(23): 24-37.
| [1] KHOSLA A, JAYADEVAPRAKASH N, YAO B, et al. Novel dataset for fine-grained image categorization: Stanford dogs[EB/OL]. (2013-09-18)[2024?12?01]. http://vision.stanford.edu/aditya86/ImageNetDogs/main.html. [2] ELINDER P, BRANSON S, MITA T, et al. The Caltech-UCSD birds-200-2011 dataset[EB/OL]. (2011-10-12)[2024-12-01]. https://www.vision.caltech.edu/datasets/cub_200_2011/. [3] BUZZELLI M, SEGANTIN L. Revisiting the CompCars dataset for hierarchical car classification: new annotations, experiments, and results[J]. Sensors, 2021, 21(2): 596. [4] HE X, PENG Y. Fine-grained image classification via combining vision and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5994-6002. [5] LIM J M, LIM K M, LEE C P, et al. A review of few-shot fine-grained image classification[J]. Expert Systems with Applications, 2025, 275: 127054. [6] GAO Y, HAN X T, WANG X, et al. Channel interaction networks for fine-grained image categorization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 10818-10825. [7] LUO W, YANG X T, MO X J, et al. Cross-X learning for fine-grained visual categorization[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 8241-8250. [8] KHAN I, SOHAIL S S, MADSEN D ?, et al. Deep transfer learning for fine-grained maize leaf disease classification[J]. Journal of Agriculture and Food Research, 2024, 16: 101148. [9] HUO H, MEI A K, XU N Y. Polymorphic clustering and approximate masking framework for fine-grained insect image classification[J]. Electronics, 2024, 13(9): 1691. [10] ZHU Y B, WANG S, YU H, et al. SFPL: sample-specific fine-grained prototype learning for imbalanced medical image classification[J]. Medical Image Analysis, 2024, 97: 103281. [11] WANG Y, MA R, MA X Q, et al. Shape-aware fine-grained classification of erythroid cells[J]. Applied Intelligence, 2023, 53(16): 19115-19127. [12] WANG H B, PENG J J, ZHAO Y Z, et al. Multi-path deep CNNs for fine?grained car recognition[J]. IEEE Transactions on Vehicular Technology, 2020, 69(10): 10484-10493. [13] HE Z, GONG P, YE H, et al. Lane attribute classification based on fine-grained description[J]. Sensors (Basel), 2024, 24(15): 4800. [14] HUANG S L, XU Z, TAO D C, et al. Part-stacked CNN for fine-grained visual categorization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 1173-1182. [15] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. [16] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020. [17] DING Y, MA Z, WEN S, et al. AP-CNN: weakly supervised attention pyramid convolutional neural network for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2021, 30: 2826-2836. [18] LIU C B, XIE H T, ZHA Z J, et al. Filtration and distillation: enhancing region attention for fine-grained visual categorization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11555-11562. [19] ZHENG H L, FU J L, ZHA Z J, et al. Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5007-5016. [20] KONG S, FOWLKES C. Low-rank bilinear pooling for fine-grained classification[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 7025-7034. [21] LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 1449-1457. [22] GAO Y, BEIJBOM O, ZHANG N, et al. Compact bilinear pooling[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 317-326. [23] LIN X, MA L, LIU W, et al. Context-gated convolution[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 701-718. [24] ALZUBAIDI L, ZHANG J, HUMAIDI A J, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions[J]. Journal of Big Data, 2021, 8(1): 53. [25] 马瑶, 智敏, 殷雁君, 等. CNN和Transformer在细粒度图像识别中的应用综述[J]. 计算机工程与应用, 2022, 58(19): 53-63. MA Y, ZHI M, YIN Y J, et al. Review of applications of CNN and Transformer in fine-grained image recognition[J]. Computer Engineering and Applications, 2022, 58(19): 53-63. [26] HE J, CHEN J N, LIU S, et al. TransFG: a Transformer architecture for fine-grained recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 852-860. [27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 6000-6010. [28] LI Q, YANG X, LU R, et al. Transformer in computer vision: a survey[J]. Journal of Chinese Computer Systems, 2023, 44(4): 2917-2970. [29] TOUVRON H, CORD M, DOUZE M, et al. Training dataefficient image transformers & distillation through attention[C]//Proceedings of the International Conference on Machine Learning, 2021, 139: 10347-10357. [30] 田战胜, 刘立波. 基于改进Transformer的细粒度图像分类模型[J]. 激光与光电子学进展, 2023, 60(2): 171-178. TIAN Z S, LIU L B. Fine-grained image classification model based on improved transformer[J]. Journal of Laser & Optoelectronics Progress, 2023, 60(2): 171-178. [31] WANG Y, YE S, YU S, et al. R2-Trans: fine-grained visual categorization with redundancy reduction[J]. arXiv:2204. 10095, 2022. [32] WANG Q, WANG J J, DENG H Y, et al. AA-Trans: core attention aggregating transformer with information entropy selector for fine-grained visual classification[J]. Pattern Recognition, 2023, 140: 109547. [33] LAI C F, LAI Y W, CHEN S Y, et al. Fuzzy optimization feature fusion for enhanced fine-grained visual classification in sustainable manufacturing using vision transformer[J]. IEEE Transactions on Fuzzy Systems, 2025: 1-18. [34] MA Z P, WU X Y, CHU A Z, et al. SwinFG: a fine-grained recognition scheme based on swin transformer[J]. Expert Systems with Applications, 2024, 244: 123021. [35] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002. [36] WANG J, YU X, GAO Y. Feature fusion vision transformer for fine-grained visual categorization[J]. arXiv:2107.02341, 2021. [37] XU Q, WANG J H, JIANG B, et al. Fine-grained visual classification via internal ensemble learning Transformer[J]. IEEE Transactions on Multimedia, 2023, 25: 9015-9028. [38] 李佳盈, 蒋文婷, 杨林, 等. 基于 ViT 的细粒度图像分类[J]. 计算机工程与设计, 2023, 44(3): 916-921. LI J Y, JIANG W T, YANG L, et al. Fine-grained visual classification based on vision Transformer[J]. Computer Engineering and Design, 2023, 44(3): 916-921. [39] XIA Y L, ZHANG J W. ORA-Trans: object region attention transformer based on key tokens selector with structure feature modeling for fine-grained visual classification[C]//Proceedings of the International Conference on Pattern Recognition. Cham: Springer, 2025: 374-389. [40] LIU X D, WANG L L, HAN X G. Transformer with peak suppression and knowledge guidance for fine-grained image recognition[J]. Neurocomputing, 2022, 492: 137-149. [41] ZHAO Y F, LI J, CHEN X W, et al. Part-guided relational Transformers for fine?grained visual recognition[J]. IEEE Transactions on Image Processing, 2021, 30: 9470-9481. [42] 黄港, 郑元林, 廖开阳, 等. 互补注意多样性特征融合网络的细粒度分类[J]. 中国图象图形学报, 2023, 28(8): 2420-2431. HUANG G, ZHENG Y L, LIAO K Y, et al. Mutual attention diversity feature fusion network-relevant fine-grained classification[J]. Journal of Image and Graphics, 2023, 28(8): 2420-2431. [43] 陆妍, 王阳萍, 王文润. 基于 Transformer 的小样本细粒度图像分类方法[J]. 计算机工程与应用, 2023, 59(23): 219-227. LU Y, WANG Y P, WANG W R. Transformer-based few-shot and fine-grained image classification method[J]. Computer Engineering and Applications, 2023, 59(23): 219-227. [44] CUI S, HUI B. Dual-dependency attention transformer for fine-grained visual classification[J]. Sensors (Basel), 2024, 24(7): 2337. [45] MEI A K, HUO H, XU J X, et al. Multistage attention region supplement transformer for fine-grained visual categorization[J]. The Visual Computer, 2025, 41(3): 1873-1889. [46] CHOU P Y, LIN C H, KAO W C. A novel plug-in module for fine-grained visual classification[J]. arXiv:2202.03822, 2022. [47] WANG H, LI Y Y, LUO H C. Semantic feature integration network for fine-grained visual classification[C]//Proceedings of the International Conference on Image, Signal Processing, and Pattern Recognition, 2024:1-8. [48] CHOU P Y, KAO Y Y, LIN C H. Fine-grained visual classification with high-temperature refinement and background suppression[J]. arXiv:2303.06442, 2023. [49] CHEN H Z, ZHANG H M, LIU C, et al. FET-FGVC: feature-enhanced transformer for fine-grained visual classification[J]. Pattern Recognition, 2024, 149: 110265. [50] WANG J, XU Q, JIANG B, et al. Multi-granularity part sampling attention for fine-grained visual classification[J]. IEEE Transactions on Image Processing, 2024, 33: 4529-4542. [51] SHEN L F, HOU B, JIAN Y L, et al. TransFGVC: transformer-based fine-grained visual classification[J]. The Visual Computer, 2025, 41(4): 2439-2459. [52] LI Y S, XIE B, LI Y L, et al. Multi-scale local regional attention fusion using visual transformers for fine-grained image classification[J]. The Visual Computer, 2025, 41(8): 5283-5298. [53] BI Q, ZHOU B C, JI W, et al. Universal fine-grained visual categorization by concept guided learning[J]. IEEE Transactions on Image Processing, 2025, 34: 394-409. [54] ZHENG Z W, ZHOU J X, GAN J H, et al. Fine-grained image classification based on cross-attention network[J]. International Journal on Semantic Web and Information Systems, 2022, 18(1): 1-12. [55] ZHU H W, KE W J, LI D, et al. Dual cross-attention learning for fine?grained visual categorization and object re-identification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 4682-4692. [56] JI R Y, LI J Y, ZHANG L B, et al. Dual Transformer with multi-grained assembly for fine-grained visual classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5009-5021. [57] ZHENG S J, WANG G C, YUAN Y J, et al. Fine-grained image classification based on TinyVit object location and graph convolution network[J]. Journal of Visual Communication and Image Representation, 2024, 100: 104120. [58] CONDE M V, TURGUTLU K. Exploring vision transformers for fine-grained classification[J]. arXiv:2106.10587, 2021. [59] HU Y Q, JIN X, ZHANG Y, et al. RAMS-Trans: recurrent attention multi-scale transformer for fine-grained image recognition[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4239-4248. [60] ZHANG Y, CAO J, ZHANG L, et al. A free lunch from ViT: adaptive attention multi-scale fusion transformer for fine-grained visual recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 3234-3238. [61] ZHANG Z C, CHEN Z D, WANG Y, et al. Vit-fod: a vision Transformer based fine?grained object discriminator[J]. arXiv:2203.12816, 2022. [62] YU Y, WANG J H, PEDRYCZ W, et al. Multi-level information fusion Transformer with background filter for fine-grained image recognition[J]. Applied Intelligence, 2024, 54(17): 8108-8119. [63] HUANG Y T, HECHEN Z Z, ZHOU M L, et al. An attention-locating algorithm for eliminating background effects in fine-grained visual classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(6): 5993-6006. [64] LYU Y L, JING L P, WANG J Q, et al. Siamese transformer with hierarchical concept embedding for fine-grained image recognition[J]. Science China Information Sciences, 2023, 66(3): 132107. [65] ZHANG Y, CHEN W W, ZANG Y. Fine-grained vision categorization with vision transformer: a survey[C]//Proceedings of the IEEE 8th International Conference on Computer and Communications. Piscataway: IEEE, 2022: 1910-1915. [66] NILSBACK M E, ZISSERMAN A. Automated flower classification over a large number of classes[C]//Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing. Piscataway: IEEE, 2008: 722-729. [67] KRAUSE J, STARK M, JIA D, et al. 3D object representations for fine-grained categorization[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE, 2013: 554-561. [68] MAJI S, RAHTU E, KANNALA J, et al. Fine-grained visual classification of aircraft[J]. arXiv:1306.5151, 2013. [69] HORN V G, BRANSON S, FARRELL R, et al. Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 595-604. [70] VAN HORN G, AODHA M O, SONG Y, et al. The iNaturalist species classification and detection dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 8769-8778. [71] VAN HORN G, COLE E, BEERY S, et al. Benchmarking representation learning for natural world image collections[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 12879-12888. [72] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[J]. International Journal of Computer Vision, 2020, 128(2): 336-359. [73] SHU Y Y, HENGEL V D A, LIU L Q. Learning common rationale to improve self-supervised representation for fine-grained visual recognition problems[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 11392-11401. [74] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Advances in Neural Information Processing Systems, 2014: 2672-2680. [75] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems, 2020: 1-12. [76] XU J, ZHANG X Q, ZHAO C M, et al. Improving fine-grained image classification with multimodal information[J]. IEEE Transactions on Multimedia, 2024, 26: 2082-2095. [77] ROY S K, DERIA A, HONG D F, et al. Multimodal fusion transformer for remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-20. [78] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 8748-8763. [79] JIANG X, TANG H, GAO J Y, et al. Delving into multimodal prompting for fine-grained visual classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2024: 2570-2578. |
| [1] | 杨鸿丹, 付贵, 邵慧超, 汪艺欣, 邵延华, 楚红雨, 邓琥. 融合多尺度层级特征的航拍小目标检测[J]. 计算机工程与应用, 2025, 61(9): 230-241. |
| [2] | 江旺玉, 王乐, 姚叶鹏, 毛国君. 多尺度特征聚合扩散和边缘信息增强的小目标检测算法[J]. 计算机工程与应用, 2025, 61(7): 105-116. |
| [3] | 许行, 温萧轲, 王文剑. 基于特征融合的部分有序深度森林模型[J]. 计算机工程与应用, 2025, 61(7): 165-175. |
| [4] | 卢敏, 胡振宇. 通信延迟下车辆协同感知的3D目标检测方法[J]. 计算机工程与应用, 2025, 61(7): 278-287. |
| [5] | 刘奎, 唐慧萍, 苏本跃. 门控卷积和高频特征融合的红外小目标检测[J]. 计算机工程与应用, 2025, 61(7): 306-314. |
| [6] | 马蕴一, 许明, 金海波. 多通道自适应特征融合的城市路网交通流量预测[J]. 计算机工程与应用, 2025, 61(7): 334-341. |
| [7] | 盛威, 周永霞, 陈俊杰, 赵平. 基于YOLOv8-S的偏光片表面缺陷检测算法[J]. 计算机工程与应用, 2025, 61(6): 128-140. |
| [8] | 郭小宇, 马静, 陈杰. 多模态分级特征映射与融合表征方法研究[J]. 计算机工程与应用, 2025, 61(6): 171-182. |
| [9] | 王燕妮, 胡敏, 韩世鹏, 陈艺瑄, 吕昊. 多尺度和多层级特征融合的人体姿态估计[J]. 计算机工程与应用, 2025, 61(6): 199-209. |
| [10] | 龚小梅, 张轶, 胡术. 引入特征融合和Transformer模型预测器的目标跟踪算法[J]. 计算机工程与应用, 2025, 61(6): 254-262. |
| [11] | 王国相, 李昌隆, 宋俊锋, 叶振, 金恒. 融合自适应采样与全局感知的图像深度估计算法[J]. 计算机工程与应用, 2025, 61(5): 261-268. |
| [12] | 肖立中, 殷晨旭. 融合预训练模型与注意力的事件抽取方法[J]. 计算机工程与应用, 2025, 61(4): 130-140. |
| [13] | 潘惟兰, 张荣芬, 刘宇红, 张吉友, 孙龙. 结合CNN-Transformer的跨模态透明物体分割[J]. 计算机工程与应用, 2025, 61(4): 222-229. |
| [14] | 赵磊, 李栋. PMM-YOLO:多尺度特征融合的交通标志检测算法[J]. 计算机工程与应用, 2025, 61(4): 262-271. |
| [15] | 张相胜, 程嘉宝, 顾斌杰. 基于旋转框定位的拆垛箱体目标检测[J]. 计算机工程与应用, 2025, 61(4): 323-330. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||