Survey of Vision Transformer in Low-Level Computer Vision

doi:10.3778/j.issn.1002-8331.2304-0139

Abstract

Abstract: Transformer is a revolutionary neural network architecture initially designed for natural language processing. However, its outstanding performance and versatility have led to widespread applications in the field of computer vision. While there is a wealth of research and literature on Transformer applications in natural language processing, there remains a relative scarcity of specialized reviews focusing on low-level visual tasks. In light of this, this paper begins by providing a brief introduction to the principles of Transformer and analyzing several variants. Subsequently, the focus shifts to the application of Transformer in low-level visual tasks, specifically in the key areas of image restoration, image enhancement, and image generation. Through a detailed analysis of the performance of different models in these tasks, this paper explores the variations in their effectiveness on commonly used datasets. This includes achievements in restoring damaged images, improving image quality, and generating realistic images. Finally, this paper summarizes and forecasts the development trends of Transformer in the field of low-level visual tasks. It suggests directions for future research to further drive innovation and advancement in Transformer applications. The rapid progress in this field promises breakthroughs for computer vision and image processing, providing more powerful and efficient solutions for practical applications.

Key words: Transformer, deep learning, attention mechanism, computer vision, low-level vision task

摘要： Transformer是一种革命性的神经网络模型架构，最初为自然语言处理而设计，但其由于卓越的性能，在计算机视觉领域获得了广泛的应用。虽然关于Transformer在自然语言处理领域的应用有大量的研究和文献，但针对低级视觉任务的综述相对匮乏。简要介绍了Transformer的原理并分析归纳了几种变体。在低级视觉任务的应用方面，将重点放在图像恢复、图像增强和图像生成这三个关键领域。通过详细分析不同模型在这些任务中的表现，探讨了它们在常用数据集上的性能差异。对Transformer在低级视觉领域的发展趋势进行了总结和展望，提出了未来的研究方向，以进一步推动Transformer在低级视觉任务中的创新和发展。这一领域的迅猛发展将为计算机视觉和图像处理领域带来更多的突破，为实际应用提供更加强大和高效的解决方案。

关键词: Transformer, 深度学习, 注意力机制, 计算机视觉, 低级视觉任务

ZHU Kai, LI Li, ZHANG Tong, JIANG Sheng, BIE Yiming. Survey of Vision Transformer in Low-Level Computer Vision[J]. Computer Engineering and Applications, 2024, 60(4): 39-56.

朱凯, 李理, 张彤, 江晟, 别一鸣. 视觉Transformer在低级视觉领域的研究综述[J]. 计算机工程与应用, 2024, 60(4): 39-56.

References

[1] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[2] ELMAN J L. Finding structure in time[J]. Cognitive Science, 1990, 14(2): 179-211.
[3] CLARK K, LUONG M T, MANNING C D, et al. Semi-supervised sequence modeling with cross-view training[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 2018: 1914-1925.
[4] LIU C, CHEN L C, SCHROFF F, et al. Auto-DeepLab: hierarchical neural architecture search for semantic image segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 2019: 82-92.
[5] ZOPH B, VASUDEVAN V, SHLENS J, et al. Learning transferable architectures for scalable image recognition[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 8697-8710.
[6] TODERICI G, VINCENT D, JOHNSTON N, et al. Full resolution image compression with recurrent neural networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 5435-5443.
[7] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C]//Advances in Neural Information Processing Systems 33, 2020: 9459-9474.
[8] LECUN Y. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[9] NAH S, KIM T H, LEE K M. Deep multi-scale convolutional neural network for dynamic scene deblurring[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 257-265.
[10] BROCK A, DE S, SMITH S L, et al. High-performance large-scale image recognition without normalization[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 1059-1071.
[11] BAE W, YOO J, YE J C. Beyond deep residual learning for image restoration: persistent homology-guided manifold simplification[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, 2017: 1141-1149.
[12] KALMAN R E. A new approach to linear filtering and prediction problems[J]. Journal of Basic Engineering, 1960, 82(1): 35-45.
[13] LUCY L B. An iterative technique for the rectification of observed distributions[J]. The Astronomical Journal, 1974, 79: 745.
[14] RICHARDSON W H. Bayesian-based iterative method of image restoration[J]. Journal of the Optical Society of America, 1972, 62(1): 55.
[15] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[16] KIM J, LEE J K, LEE K M. Accurate image super-resolution using very deep convolutional networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016: 1646-1654.
[17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017.
[18] PARMAR N, VASWANI A, USZKOREIT J, et al. Image transformer[C]//Proceedings of the 35th International Conference on Machine Learning, 2018: 4055-4064.
[19] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[J/OL]. OpenAI (2018)[2023-03-18]. https://cdn.openai.com/ research-covers/language-unsupervised/language_understanding_paper.pdf.
[20] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[21] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[23] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[J]. arXiv:2005.12872, 2020.
[24] YU J, WANG Z, VASUDEVAN V, et al. CoCa: contrastive captioners are image-text foundation models[J]. arXiv: 2205.01917, 2022.
[25] STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation[C]//Proceedings of the 2021 IEEE/CVF international conference on Computer Vision, 2021: 7262-7272.
[26] ZAMIR S W, ARORA A, KHAN S, et al. Restormer: efficient transformer for high-resolution image restoration[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 5718-5729.
[27] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[28] TAN M, LE Q. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 6105-6114.
[29] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015: 1-9.
[30] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[31] HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[32] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[33] MAO X, LIU Y, LIU F, et al. Intriguing findings of frequency selection for image deblurring[J]. arXiv:2111.11745, 2021.
[34] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: training multi-billion parameter language models using model parallelism[J]. arXiv:1909.08053, 2019.
[35] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. arXiv: 1409.0473, 2014.
[36] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems 27, 2014.
[37] LUONG M T, PHAM H, MANNING C D. Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025, 2015.
[38] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv:1406.1078, 2014.
[39] BA J L, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016.
[40] WANG Q, LI B, XIAO T, et al. Learning deep transformer models for machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 2019: 1810-1822.
[41] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 10347-10357.
[42] WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 22-31.
[43] D’ASCOLI S, TOUVRON H, LEAVITT M L, et al. ConViT: improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 2286-2296.
[44] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 9992-10002.
[45] CHU X, TIAN Z, WANG Y, et al. Twins: revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems 34, 2021: 9355-9366.
[46] CHEN C F, PANDA R, FAN Q. RegionViT: regional-to-local attention for vision transformers[J]. arXiv:2106.02689, 2021.
[47] HAN K, XIAO A, WU E, et al. Transformer in transformer[C]//Advances in Neural Information Processing Systems 34, 2021: 15908-15919.
[48] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 568-578.
[49] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 538-547.
[50] ZHOU D, KANG B, JIN X, et al. DeepViT: towards deeper vision transformer[J]. arXiv:2103.11886, 2021.
[51] TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 32-42.
[52] XU P, ZHU X, CLIFTON D A. Multimodal learning with transformers: a survey[J]. arXiv:2206.06488, 2022.
[53] CHEN B, LI P, LI B, et al. PSViT: better vision transformer via token pooling and attention sharing[J]. arXiv:2108. 03428, 2021.
[54] ZHOU J, WANG P, WANG F, et al. ELSA: enhanced local self-attention for vision transformer[J]. arXiv:2112.12786, 2021.
[55] BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document transformer[J]. arXiv:2004.05150, 2020.
[56] CHEN H, WANG Y, GUO T, et al. Pre-trained image processing transformer[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 2021: 12294-12305.
[57] LIANG J, CAO J, SUN G, et al. SwinIR: image restoration using swin transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, 2021: 1833-1844.
[58] WANG Z, CUN X, BAO J, et al. Uformer: a general U-shaped transformer for image restoration[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 17662-17672.
[59] TSAI F J, PENG Y T, LIN Y Y, et al. Stripformer: strip transformer for fast image deblurring[C]//LNCS 13679: Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162.
[60] XIAO J, FU X, WU F, et al. Stochastic window transformer for image restoration[C]//Advances in Neural Information Processing Systems 35, 2022: 9315-9329.
[61] CHEN L, CHU X, ZHANG X, et al. Simple baselines for image restoration[J]. arXiv:2204.04676, 2022.
[62] ZHAO Q, YANG H, ZHOU D, et al. Rethinking image deblurring via CNN-Transformer multiscale hybrid architecture[J]. IEEE Transactions on Instrumentation and Measurement, 2023, 72: 1-15.
[63] NIU B, WEN W, REN W, et al. Single image super-resolution via a holistic attention network[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 191-207.
[64] MEI Y, FAN Y, ZHOU Y. Image super-resolution with non-local sparse attention[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 3517-3526.
[65] ZHANG X, ZENG H, GUO S, et al. Efficient long-range attention network for image super-resolution[J]. arXiv:2203.06697, 2022.
[66] LU Z, LI J, LIU H, et al. Transformer for single image super-resolution[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 457-466.
[67] GAO G, WANG Z, LI J, et al. Lightweight bimodal network for single-image super-resolution via symmetric CNN and recursive transformer[J]. arXiv:2204.13286, 2022.
[68] CAI Q, QIAN Y, LI J, et al. HIPA: hierarchical patch transformer for single image super resolution[J]. arXiv:2203. 10247, 2022.
[69] CHEN X, WANG X, ZHOU J, et al. Activating more pixels in image super-resolution transformer[J]. arXiv:2205.04437, 2022.
[70] ZHANG H, GOODFELLOW I, METAXAS D, et al. Self-attention generative adversarial networks[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 7354-7363.
[71] LEE K, CHANG H, JIANG L, et al. ViTGAN: training GANs with vision transformers[J]. arXiv:2107.04589, 2021.
[72] ZHAO L, ZHANG Z, CHEN T, et al. Improved transformer for high-resolution GANs[C]//Advances in Neural Information Processing Systems 34, 2021: 18367-18380.
[73] JIANG Y, CHANG S, WANG Z. TransGAN: two pure transformers can make one strong GAN, and that can scale up[C]//Advances in Neural Information Processing Systems 34, 2021: 14745-14758.
[74] ZHANG B, GU S, ZHANG B, et al. StyleSwin: transformer-based GAN for high-resolution image generation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 11294-11304.
[75] PARK J, KIM Y. Styleformer: transformer based generative adversarial networks with style vector[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 8973-8982.
[76] WALTON S, HASSANI A, XU X, et al. StyleNAT: giving each head a new perspective[J]. arXiv:2211.05770, 2022.
[77] PEEBLES W, XIE S. Scalable diffusion models with transformers[J]. arXiv:2212.09748, 2022.
[78] KUPYN O, MARTYNIUK T, WU J, et al. DeblurGAN-v2: deblurring (orders-of-magnitude) faster and better[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 8877-8886.
[79] ZAMIR S W, ARORA A, KHAN S, et al. Multi-stage progressive image restoration[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 2021: 14816-14826.
[80] SHEN Z, WANG W, LU X, et al. Human-aware motion deblurring[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 5571-5580.
[81] RIM J, LEE H, WON J, et al. Real-world blur dataset for learning and benchmarking deblurring algorithms[C]//LNCS 12370: Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 184-201.
[82] JIANG K, WANG Z, YI P, et al. Multi-scale progressive fusion network for single image deraining[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8346-8355.
[83] MARTIN D, FOWLKES C, TAL D, et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics[C]//Proceedings of the 8th IEEE International Conference on Computer Vision, Vancouver, 2001: 416-423.
[84] ABDELHAMED A, LIN S, BROWN M S. A high-quality denoising dataset for smartphone cameras[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 1692-1700.
[85] PLOTZ T, ROTH S. Benchmarking denoising algorithms with real photographs[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 2750-2759.
[86] BEVILACQUA M, ROUMY A, GUILLEMOT C, et al. Low-complexity single-image super-resolution based on nonnegative neighbor embedding[C]//Procedings of the British Machine Vision Conference 2012. Surrey: British Machine Vision Association, 2012: 135.
[87] TAO X, GAO H, LIAO R, et al. Detail-revealing deep video super-resolution[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 4472-4480.
[88] HUANG J B, SINGH A, AHUJA N. Single image super-resolution from transformed self-exemplars[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015: 5197-5206.
[89] MATSUI Y, ITO K, ARAMAKI Y, et al. Sketch-based manga retrieval using manga109 dataset[J]. Multimedia Tools and Applications, 2017, 76(20): 21811-21838.
[90] KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[J]. Handbook of Systemic Autoimmune Diseases, 2009, 1(4).
[91] LIU Z, LUO P, WANG X, et al. Deep learning face attributes in the wild[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015: 3730-3738.
[92] KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4401-4410.
[93] ZHANG K, LI Y, ZUO W, et al. Plug-and-play image restoration with deep denoiser prior[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 6360-6376.
[94] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[J].arXiv:1505.04597, 2015.
[95] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems 33, 2020: 6840-6851.
[96] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[J]. arXiv:2204.06125, 2022.