计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (4): 39-56.DOI: 10.3778/j.issn.1002-8331.2304-0139
朱凯,李理,张彤,江晟,别一鸣
出版日期:
2024-02-15
发布日期:
2024-02-15
ZHU Kai, LI Li, ZHANG Tong, JIANG Sheng, BIE Yiming
Online:
2024-02-15
Published:
2024-02-15
摘要: Transformer是一种革命性的神经网络模型架构,最初为自然语言处理而设计,但其由于卓越的性能,在计算机视觉领域获得了广泛的应用。虽然关于Transformer在自然语言处理领域的应用有大量的研究和文献,但针对低级视觉任务的综述相对匮乏。简要介绍了Transformer的原理并分析归纳了几种变体。在低级视觉任务的应用方面,将重点放在图像恢复、图像增强和图像生成这三个关键领域。通过详细分析不同模型在这些任务中的表现,探讨了它们在常用数据集上的性能差异。对Transformer在低级视觉领域的发展趋势进行了总结和展望,提出了未来的研究方向,以进一步推动Transformer在低级视觉任务中的创新和发展。这一领域的迅猛发展将为计算机视觉和图像处理领域带来更多的突破,为实际应用提供更加强大和高效的解决方案。
朱凯, 李理, 张彤, 江晟, 别一鸣. 视觉Transformer在低级视觉领域的研究综述[J]. 计算机工程与应用, 2024, 60(4): 39-56.
ZHU Kai, LI Li, ZHANG Tong, JIANG Sheng, BIE Yiming. Survey of Vision Transformer in Low-Level Computer Vision[J]. Computer Engineering and Applications, 2024, 60(4): 39-56.
[1] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. [2] ELMAN J L. Finding structure in time[J]. Cognitive Science, 1990, 14(2): 179-211. [3] CLARK K, LUONG M T, MANNING C D, et al. Semi-supervised sequence modeling with cross-view training[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, 2018: 1914-1925. [4] LIU C, CHEN L C, SCHROFF F, et al. Auto-DeepLab: hierarchical neural architecture search for semantic image segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 2019: 82-92. [5] ZOPH B, VASUDEVAN V, SHLENS J, et al. Learning transferable architectures for scalable image recognition[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 8697-8710. [6] TODERICI G, VINCENT D, JOHNSTON N, et al. Full resolution image compression with recurrent neural networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 5435-5443. [7] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C]//Advances in Neural Information Processing Systems 33, 2020: 9459-9474. [8] LECUN Y. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [9] NAH S, KIM T H, LEE K M. Deep multi-scale convolutional neural network for dynamic scene deblurring[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 257-265. [10] BROCK A, DE S, SMITH S L, et al. High-performance large-scale image recognition without normalization[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 1059-1071. [11] BAE W, YOO J, YE J C. Beyond deep residual learning for image restoration: persistent homology-guided manifold simplification[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, 2017: 1141-1149. [12] KALMAN R E. A new approach to linear filtering and prediction problems[J]. Journal of Basic Engineering, 1960, 82(1): 35-45. [13] LUCY L B. An iterative technique for the rectification of observed distributions[J]. The Astronomical Journal, 1974, 79: 745. [14] RICHARDSON W H. Bayesian-based iterative method of image restoration[J]. Journal of the Optical Society of America, 1972, 62(1): 55. [15] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144. [16] KIM J, LEE J K, LEE K M. Accurate image super-resolution using very deep convolutional networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016: 1646-1654. [17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017. [18] PARMAR N, VASWANI A, USZKOREIT J, et al. Image transformer[C]//Proceedings of the 35th International Conference on Machine Learning, 2018: 4055-4064. [19] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[J/OL]. OpenAI (2018)[2023-03-18]. https://cdn.openai.com/ research-covers/language-unsupervised/language_understanding_paper.pdf. [20] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018. [21] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901. [22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020. [23] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[J]. arXiv:2005.12872, 2020. [24] YU J, WANG Z, VASUDEVAN V, et al. CoCa: contrastive captioners are image-text foundation models[J]. arXiv: 2205.01917, 2022. [25] STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation[C]//Proceedings of the 2021 IEEE/CVF international conference on Computer Vision, 2021: 7262-7272. [26] ZAMIR S W, ARORA A, KHAN S, et al. Restormer: efficient transformer for high-resolution image restoration[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 5718-5729. [27] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778. [28] TAN M, LE Q. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 6105-6114. [29] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015: 1-9. [30] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [31] HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017. [32] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014. [33] MAO X, LIU Y, LIU F, et al. Intriguing findings of frequency selection for image deblurring[J]. arXiv:2111.11745, 2021. [34] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: training multi-billion parameter language models using model parallelism[J]. arXiv:1909.08053, 2019. [35] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. arXiv: 1409.0473, 2014. [36] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems 27, 2014. [37] LUONG M T, PHAM H, MANNING C D. Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025, 2015. [38] CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv:1406.1078, 2014. [39] BA J L, KIROS J R, HINTON G E. Layer normalization[J]. arXiv:1607.06450, 2016. [40] WANG Q, LI B, XIAO T, et al. Learning deep transformer models for machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 2019: 1810-1822. [41] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 10347-10357. [42] WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 22-31. [43] D’ASCOLI S, TOUVRON H, LEAVITT M L, et al. ConViT: improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 2286-2296. [44] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 9992-10002. [45] CHU X, TIAN Z, WANG Y, et al. Twins: revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems 34, 2021: 9355-9366. [46] CHEN C F, PANDA R, FAN Q. RegionViT: regional-to-local attention for vision transformers[J]. arXiv:2106.02689, 2021. [47] HAN K, XIAO A, WU E, et al. Transformer in transformer[C]//Advances in Neural Information Processing Systems 34, 2021: 15908-15919. [48] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 568-578. [49] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 538-547. [50] ZHOU D, KANG B, JIN X, et al. DeepViT: towards deeper vision transformer[J]. arXiv:2103.11886, 2021. [51] TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 32-42. [52] XU P, ZHU X, CLIFTON D A. Multimodal learning with transformers: a survey[J]. arXiv:2206.06488, 2022. [53] CHEN B, LI P, LI B, et al. PSViT: better vision transformer via token pooling and attention sharing[J]. arXiv:2108. 03428, 2021. [54] ZHOU J, WANG P, WANG F, et al. ELSA: enhanced local self-attention for vision transformer[J]. arXiv:2112.12786, 2021. [55] BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document transformer[J]. arXiv:2004.05150, 2020. [56] CHEN H, WANG Y, GUO T, et al. Pre-trained image processing transformer[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 2021: 12294-12305. [57] LIANG J, CAO J, SUN G, et al. SwinIR: image restoration using swin transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, 2021: 1833-1844. [58] WANG Z, CUN X, BAO J, et al. Uformer: a general U-shaped transformer for image restoration[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 17662-17672. [59] TSAI F J, PENG Y T, LIN Y Y, et al. Stripformer: strip transformer for fast image deblurring[C]//LNCS 13679: Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162. [60] XIAO J, FU X, WU F, et al. Stochastic window transformer for image restoration[C]//Advances in Neural Information Processing Systems 35, 2022: 9315-9329. [61] CHEN L, CHU X, ZHANG X, et al. Simple baselines for image restoration[J]. arXiv:2204.04676, 2022. [62] ZHAO Q, YANG H, ZHOU D, et al. Rethinking image deblurring via CNN-Transformer multiscale hybrid architecture[J]. IEEE Transactions on Instrumentation and Measurement, 2023, 72: 1-15. [63] NIU B, WEN W, REN W, et al. Single image super-resolution via a holistic attention network[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 191-207. [64] MEI Y, FAN Y, ZHOU Y. Image super-resolution with non-local sparse attention[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 3517-3526. [65] ZHANG X, ZENG H, GUO S, et al. Efficient long-range attention network for image super-resolution[J]. arXiv:2203.06697, 2022. [66] LU Z, LI J, LIU H, et al. Transformer for single image super-resolution[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 457-466. [67] GAO G, WANG Z, LI J, et al. Lightweight bimodal network for single-image super-resolution via symmetric CNN and recursive transformer[J]. arXiv:2204.13286, 2022. [68] CAI Q, QIAN Y, LI J, et al. HIPA: hierarchical patch transformer for single image super resolution[J]. arXiv:2203. 10247, 2022. [69] CHEN X, WANG X, ZHOU J, et al. Activating more pixels in image super-resolution transformer[J]. arXiv:2205.04437, 2022. [70] ZHANG H, GOODFELLOW I, METAXAS D, et al. Self-attention generative adversarial networks[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 7354-7363. [71] LEE K, CHANG H, JIANG L, et al. ViTGAN: training GANs with vision transformers[J]. arXiv:2107.04589, 2021. [72] ZHAO L, ZHANG Z, CHEN T, et al. Improved transformer for high-resolution GANs[C]//Advances in Neural Information Processing Systems 34, 2021: 18367-18380. [73] JIANG Y, CHANG S, WANG Z. TransGAN: two pure transformers can make one strong GAN, and that can scale up[C]//Advances in Neural Information Processing Systems 34, 2021: 14745-14758. [74] ZHANG B, GU S, ZHANG B, et al. StyleSwin: transformer-based GAN for high-resolution image generation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 11294-11304. [75] PARK J, KIM Y. Styleformer: transformer based generative adversarial networks with style vector[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 8973-8982. [76] WALTON S, HASSANI A, XU X, et al. StyleNAT: giving each head a new perspective[J]. arXiv:2211.05770, 2022. [77] PEEBLES W, XIE S. Scalable diffusion models with transformers[J]. arXiv:2212.09748, 2022. [78] KUPYN O, MARTYNIUK T, WU J, et al. DeblurGAN-v2: deblurring (orders-of-magnitude) faster and better[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 8877-8886. [79] ZAMIR S W, ARORA A, KHAN S, et al. Multi-stage progressive image restoration[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 2021: 14816-14826. [80] SHEN Z, WANG W, LU X, et al. Human-aware motion deblurring[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 5571-5580. [81] RIM J, LEE H, WON J, et al. Real-world blur dataset for learning and benchmarking deblurring algorithms[C]//LNCS 12370: Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 184-201. [82] JIANG K, WANG Z, YI P, et al. Multi-scale progressive fusion network for single image deraining[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8346-8355. [83] MARTIN D, FOWLKES C, TAL D, et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics[C]//Proceedings of the 8th IEEE International Conference on Computer Vision, Vancouver, 2001: 416-423. [84] ABDELHAMED A, LIN S, BROWN M S. A high-quality denoising dataset for smartphone cameras[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 1692-1700. [85] PLOTZ T, ROTH S. Benchmarking denoising algorithms with real photographs[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 2750-2759. [86] BEVILACQUA M, ROUMY A, GUILLEMOT C, et al. Low-complexity single-image super-resolution based on nonnegative neighbor embedding[C]//Procedings of the British Machine Vision Conference 2012. Surrey: British Machine Vision Association, 2012: 135. [87] TAO X, GAO H, LIAO R, et al. Detail-revealing deep video super-resolution[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 4472-4480. [88] HUANG J B, SINGH A, AHUJA N. Single image super-resolution from transformed self-exemplars[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015: 5197-5206. [89] MATSUI Y, ITO K, ARAMAKI Y, et al. Sketch-based manga retrieval using manga109 dataset[J]. Multimedia Tools and Applications, 2017, 76(20): 21811-21838. [90] KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[J]. Handbook of Systemic Autoimmune Diseases, 2009, 1(4). [91] LIU Z, LUO P, WANG X, et al. Deep learning face attributes in the wild[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015: 3730-3738. [92] KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4401-4410. [93] ZHANG K, LI Y, ZUO W, et al. Plug-and-play image restoration with deep denoiser prior[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 6360-6376. [94] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[J].arXiv:1505.04597, 2015. [95] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems 33, 2020: 6840-6851. [96] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with CLIP latents[J]. arXiv:2204.06125, 2022. |
[1] | 金涛, 金冉, 侯腾达, 袁杰, 顾骁哲. 多模态检索研究综述[J]. 计算机工程与应用, 2024, 60(5): 62-75. |
[2] | 李毅, 张德生, 张晓. 一种改进的局部均值伪近邻算法[J]. 计算机工程与应用, 2024, 60(5): 88-94. |
[3] | 苏振强, 苟刚. 联合知识和视觉信息推理的视觉问答研究[J]. 计算机工程与应用, 2024, 60(5): 95-102. |
[4] | 曾凡智, 吴楚涛, 周燕. 跨域人脸活体检测的单边对抗网络算法[J]. 计算机工程与应用, 2024, 60(5): 103-111. |
[5] | 徐学锋, 韩虎. 面向方面级情感分析的多视图表示模型[J]. 计算机工程与应用, 2024, 60(5): 112-121. |
[6] | 蔡国永, 李安庆. 提示学习启发的无监督情感风格迁移研究[J]. 计算机工程与应用, 2024, 60(5): 146-155. |
[7] | 贺愉婷, 车进, 吴金蔓, 马鹏森. OMC框架下的行人多目标跟踪算法研究[J]. 计算机工程与应用, 2024, 60(5): 172-182. |
[8] | 王蓉, 端木春江. 多耦合反馈网络的图像融合和超分辨率方法[J]. 计算机工程与应用, 2024, 60(5): 210-220. |
[9] | 林本旺, 赵光哲, 王雪平, 李昊. 基于组残差块生成对抗网络的面部表情生成[J]. 计算机工程与应用, 2024, 60(5): 240-249. |
[10] | 谢若冰, 李茂军, 李宜伟, 胡建文. 改进YOLOX-s的密集垃圾检测方法[J]. 计算机工程与应用, 2024, 60(5): 250-258. |
[11] | 李旭, 宋焕生, 史勤, 张朝阳, 刘泽东, 孙士杰. CIEFRNet:面向高速公路的抛洒物检测算法[J]. 计算机工程与应用, 2024, 60(5): 336-346. |
[12] | 苏晨阳, 武文红, 牛恒茂, 石宝, 郝旭, 王嘉敏, 高勒, 汪维泰. 深度学习的工人多种不安全行为识别方法综述[J]. 计算机工程与应用, 2024, 60(5): 30-46. |
[13] | 哈里旦木·阿布都克里木, 冯珂, 史亚庆, 尼合买提·阿布都克力木, 阿布都克力木·阿布力孜. 深度学习在骨折诊断中的应用综述[J]. 计算机工程与应用, 2024, 60(5): 47-61. |
[14] | 方红, 李德生, 蒋广杰. 高效跨域的Transformer小样本语义分割网络[J]. 计算机工程与应用, 2024, 60(4): 142-152. |
[15] | 陈钊鸿, 洪智勇, 余文华, 张昕. 采用平衡函数的大规模多标签文本分类[J]. 计算机工程与应用, 2024, 60(4): 163-172. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||