Survey of Transformer Research in Computer Vision

doi:10.3778/j.issn.1002-8331.2204-0207

Abstract

Abstract: Transformer is a deep neural network based on self-attention mechanism. In recent years, Transformer-based models have become a hot research direction in the field of computer vision, and their structures are constantly being improved and expanded, such as local attention mechanisms, pyramid structures, and so on. Through the improved vision model based on Transformer structure, the performance optimization and structure improvement are reviewed and summarized respectively. In addition，the advantages and disadvantages of the respective structures of the Transformer and convolutional neural network（CNN） are compared and analyzed，and a new hybrid structure of CNN+Transformer is introduced. Finally，the development of Transformer in computer vision is summarized and prospected.

Key words: Transformer, convolutional neural network（CNN）, hybrid structure, computer vision, deep learning

摘要： Transformer是一种基于自注意力机制的深度神经网络。近几年，基于Transformer的模型已成为计算机视觉领域的热门研究方向，其结构也在不断改进和扩展，比如局部注意力机制、金字塔结构等。通过对基于Transformer结构改进的视觉模型，分别从性能优化和结构改进两个方面进行综述和总结；也对比分析了Transformer和CNN各自结构的优缺点，并介绍了一种新型的CNN+Transformer的混合结构；最后，对Transformer在计算机视觉上的发展进行总结和展望。

关键词: Transformer, 卷积神经网络（CNN）, 混合结构, 计算机视觉, 深度学习

LI Xiang, ZHANG Tao, ZHANG Zhe, WEI Hongyang, QIAN Yurong. Survey of Transformer Research in Computer Vision[J]. Computer Engineering and Applications, 2023, 59(1): 1-14.

李翔, 张涛, 张哲, 魏宏杨, 钱育蓉. Transformer在计算机视觉领域的研究综述[J]. 计算机工程与应用, 2023, 59(1): 1-14.

References

[1] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[2] DEVLIN J，CHANG M W，LEE K，et al.Bert：Pretraining of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[3] RADFORD A，NARASIMHAN K，SALIMANS T，et al.Improving language understanding by generative pre-training[C]//Proceedings of the OpenAI，2018.
[4] RADFORD A，WU J，CHILD R，et al.Language models are unsupervised multitask learners[J].OpenAI blog，2019，1（8）：9.
[5] BROWN T，MANN B，RYDER N，et al.Language models are few-shot learners[C]//Advances in Neural Information Processing Systems，2020：1877-1901.
[6] LIU Y，OTT M，GOYAL N，et al.Roberta：a robustly optimized bert pretraining approach[J].arXiv：1907.11692，2019.
[7] RAFFEL C，SHAZEER N，ROBERTS A，et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].arXiv：1910.10683，2019.
[8] 周志华.机器学习[M].北京：清华大学出版社，2016：6-10.
ZHOU Z H.Machine learning[M].Beijing：Tsinghua University Press，2016：6-10.
[9] LECUN Y，BOSER B，DENKER J S，et al.Backpropaga tion applied to handwritten zip code recognition[J].Neural Computation，1989，1（4）：541-551.
[10] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[11] PARMAR N，VASWANI A，USZKOREIT J，et al.Image transformer[C]//Proceedings of the International Conference on Machine Learning，2018：4055-4064.
[12] CARION N，MASSA F，SYNNAEVE G，et al.End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision，2020：213-229.
[13] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16x16 words：Transformers for image recognition at scale[J].arXiv：2010.11929，2020.
[14] ZHU X，SU W，LU L，et al.Deformable DETR：Deformable transformers for end-to-end object detection[J].arXiv：2010.04159，2020.
[15] ZHU X，HU H，LIN S，et al.Deformable convnets V2：More deformable，better results[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：9308-9316.
[16] ZHENG S，LU J，ZHAO H，et al.Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：6881-6890.
[17] ZENG Y，FU J，CHAO H.Learning joint spatial-temporal transformations for video inpainting[C]//Proceedings of the European Conference on Computer Vision，2020：528-543.
[18] SHAW P，USZKOREIT J，VASWANI A.Self-attention with relative position representations[J].arXiv：1803. 02155，2018.
[19] LIU X，YU H F，DHILLON I，et al.Learning to encode position for transformer with continuous dynamical model[C]//Proceedings of the International Conference on Machine Learning，2020：6327-6335.
[20] ISLAM M A，JIA S，BRUCE N D B.How much position information do convolutional neural networks encode?[J].arXiv：2001.08248，2020.
[21] WANG B，ZHAO D，LIOMA C，et al.Encoding word order in complex embeddings[J].arXiv：1912.12333，2019.
[22] CHU X，TIAN Z，ZHANG B，et al.Conditional positional encodings for vision transformers[J].arXiv：2102.10882，2021.
[23] CHILD R，GRAY S，RADFORD A，et al.Generating long sequences with sparse transformers[J].arXiv：1904. 10509，2019.
[24] LIU Z，HU H，LIN Y，et al.Swin Transformer V2：Scaling up capacity and resolution[J].arXiv：2111.09883，2021.
[25] LIU Z，LIN Y，CAO Y，et al.Swin transformer：Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：10012-10022.
[26] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：Common objects in context[C]//Proceedings of the European Conference on Computer Vision，2014：740-755.
[27] DAI J，QI H，XIONG Y，et al.Deformable convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：764-773.
[28] LIN T Y，DOLLáR P，GIRSHICK R，et al.Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：2117-2125.
[29] SUN Z，CAO S，YANG Y，et al.Rethinking transformer- based set prediction for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：3611-3620.
[30] ZHENG M，GAO P，ZHANG R，et al.End-to-end object detection with adaptive clustering transformer[J].arXiv：2011.09315，2020.
[31] BEAL J，KIM E，TZENG E，et al.Toward transformer- based object detection[J].arXiv：2012.09958，2020.
[32] LI H，SUI M，ZHAO F，et al.MVT：Mask vision transformer for facial expression recognition in the wild[J].arXiv：2106.04520，2021.
[33] TOUVRON H，CORD M，DOUZE M，et al.Training data-efficient image transformers & distillation through attention[C]//Proceedings of the International Conference on Machine Learning，2021：10347-10357.
[34] WANG W，XIE E，LI X，et al.Pyramid vision transformer：A versatile backbone for dense prediction without convolu tions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：568-578.
[35] HAN K，XIAO A，WU E，et al.Transformer in transformer[C]//Advances in Neural Information Processing Systems，2021：15908-15919.
[36] CHU X，TIAN Z，WANG Y，et al.Twins：Revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems，2021：9355-9366.
[37] WANG W，XIE E，LI X，et al.PVT V2：Improved baselines with pyramid vision transformer[J].Computational Visual Media，2022，8（3）：415-424.
[38] WU H，XIAO B，CODELLA N，et al.CVT：Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：22-31.
[39] XU W，XU Y，CHANG T，et al.Co-scale conv-attentional image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：9981-9990.
[40] YUAN L，CHEN Y，WANG T，et al.Tokens-to-token VIT：Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：558-567.
[41] HEO B，YUN S，HAN D，et al.Rethinking spatial dimensions of vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：11936-11945.
[42] YUAN K，GUO S，LIU Z，et al.Incorporating convolution designs into visual transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：579-588.
[43] CHEN C F，PANDA R，FAN Q.Regionvit：Regional-to-local attention for vision transformers[J].arXiv：2106. 02689，2021.
[44] YANG J，LI C，ZHANG P，et al.Focal self-attention for local-global interactions in vision transformers[J].arXiv：2107.00641，2021.
[45] DONG X，BAO J，CHEN D，et al.CSwin transformer：A general vision transformer backbone with cross-shaped windows[J].arXiv：2107.00652，2021.
[46] LIU Y，SUN G，QIU Y，et al.Transformer in convolutional neural networks[J].arXiv：2106.03180，2021.
[47] WU S，WU T，TAN H，et al.Pale transformer：A general vision transformer backbone with pale-shaped attention[J].arXiv：2112.14000，2021.
[48] HE K，CHEN X，XIE S，et al.Masked autoencoders are scalable vision learners[J].arXiv：2111.06377，2021.
[49] STRUDEL R，GARCIA R，LAPTEV I，et al.Segmenter：Transformer for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：7262-7272.
[50] XIE E，WANG W，YU Z，et al.SegFormer：Simple and efficient design for semantic segmentation with transfor mers[C]//Advances in Neural Information Processing Systems，2021：12077-12090.
[51] ZHANG H，CISSE M，DAUPHIN Y N，et al.Mixup：Beyond empirical risk minimization[J].arXiv：1710. 09412，2017.
[52] CHEN J N，SUN S，HE J，et al.TransMix：Attend to Mix for vision transformers[J].arXiv：2111.09833，2021.
[53] JIANG Z H，HOU Q，YUAN L，et al.All tokens matter：Token labeling for training better vision transformers[C]//Advances in Neural Information Processing Systems，2021：18590-18602.
[54] CHEN X，XIE S，HE K.An empirical study of training self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：9640-9649.
[55] WANG W，YAO L，CHEN L，et al.Crossformer：A versatile vision transformer based on cross-scale attention[J].arXiv：2108.00154，2021.
[56] RYOO M S，PIERGIOVANNI A J，ARNAB A，et al.TokenLearner：What can 8 learned tokens do for images and videos?[J].arXiv：2106.11297，2021.
[57] CHEN B，LI P，LI B，et al.PSVIT：Better vision transformer via token pooling and attention sharing[J].arXiv：2108.03428，2021.
[58] SHAZEER N，LAN Z，CHENG Y，et al.Talking-heads attention[J].arXiv：2003.02436，2020.
[59] ZHOU D，SHI Y，KANG B，et al.Refiner：Refining self-attention for vision transformers[J].arXiv：2106.03714，2021.
[60] ZHOU D，KANG B，JIN X，et al.Deepvit：Towards deeper vision transformer[J].arXiv：2103.11886，2021.
[61] TOUVRON H，CORD M，SABLAYROLLES A，et al.Going deeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：32-42.
[62] ALI A，TOUVRON H，CARON M，et al.XCIT：Cross-covariance image transformers[C]//Advances in Neural Information Processing Systems，2021：20014-20027.
[63] ZHANG P，DAI X，YANG J，et al.Multi-scale vision longformer：A new vision transformer for high-resolution image encoding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：2998-3008.
[64] BELTAGY I，PETERS M E，COHAN A.Longformer：The long-document transformer[J].arXiv：2004.05150，2020.
[65] WANG P，WANG X，WANG F，et al.KVT：KNN attention for boosting vision transformers[J].arXiv：2106.00515，2021.
[66] XIA Z，PAN X，SONG S，et al.Vision transformer with deformable attention[J].arXiv：2201.00520，2022.
[67] HAN Q，FAN Z，DAI Q，et al.Demystifying local vision transformer：Sparse connectivity，weight sharing，and dynamic weight[J].arXiv：2106.04263，2021.
[68] LIN H，CHENG X，WU X，et al.CAT：Cross attention in vision transformer[J].arXiv：2106.05786，2021.
[69] WU K，PENG H，CHEN M，et al.Rethinking and improving relative position encoding for vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：10033-10041.
[70] TANG Y，HAN K，XU C，et al.Augmented shortcuts for vision transformers[C]//Advances in Neural Information Processing Systems，2021：15316-15327.
[71] TOLSTIKHIN I O，HOULSBY N，KOLESNIKOV A，et al.Mlp-mixer：An all-mlp architecture for vision[C]//Advances in Neural Information Processing Systems，2021：24261-24272.
[72] TOUVRON H，BOJANOWSKI P，CARON M，et al.Resmlp：Feedforward networks for image classification with data-efficient training[J].arXiv：2105.03404，2021.
[73] MELAS-KYRIAZI L.Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet[J].arXiv：2105.02723，2021.
[74] CHEN J，LU Y，YU Q，et al.Transunet：Transformers make strong encoders for medical image segmentation[J].arXiv：2102.04306，2021.
[75] XIAO T，SINGH M，MINTUN E，et al.Early convolutions help transformers see better[C]//Advances in Neural Information Processing Systems，2021：30392-30400.
[76] LI K，WANG Y，ZHANG J，et al.UniFormer：Unifying convolution and self-attention for visual recognition[J].arXiv：2201.09450，2022.
[77] GRAHAM B，EL-NOUBY A，TOUVRON H，et al.LeViT：A vision transformer in ConvNet’s clothing for faster inference[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：12259-12269.
[78] WANG X，GIRSHICK R，GUPTA A，et al.Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7794-7803.
[79] D’ASCOLI S，TOUVRON H，LEAVITT M L，et al.Convit：Improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the International Conference on Machine Learning，2021：2286-2296.
[80] PENG Z，HUANG W，GU S，et al.Conformer：Local features coupling global representations for visual recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：367-376.
[81] CHEN Y，DAI X，CHEN D，et al.Mobile-former：Bridging mobilenet and transformer[J].arXiv：2108.05895，2021.
[82] HUANG Z，WANG X，HUANG L，et al.CCnet：Criss- cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：603-612.
[83] SRINIVAS A，LIN T Y，PARMAR N，et al.Bottleneck transformers for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：16519-16529.
[84] HU H，ZHANG Z，XIE Z，et al.Local relation networks for image recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：3464-3473.
[85] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[86] DE BRABANDERE B，JIA X，TUYTELAARS T，et al.Dynamic filter networks[J].arXiv：1605.09673，2016.
[87] HU J，SHEN L，ALBANIE S，et al.Gather-excite：Exploiting feature context in convolutional neural networks[J].arXiv：1810.12348，2018.
[88] YU C，XIAO B，GAO C，et al.Lite-HRNet：A lightweight high-resolution network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：10440-10450.
[89] LI D，HU J，WANG C，et al.Involution：Inverting the inherence of convolution for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：12321-12330.
[90] TROCKMAN A，KOLTER J Z.Patches are all you need?[J].arXiv：2201.09792，2022.
[91] KHAN S，NASEER M，HAYAT M，et al.Transformers in vision：A survey[J].ACM Computing Surveys（CSUR），2022，54（10）：1-41.
[92] HAN K，WANG Y，CHEN H，et al.A survey on visual transformer[J].arXiv：2012.12556，2020.
[93] 田永林，王雨桐，王建功，等.视觉Transformer研究的关键问题：现状及展望[J].自动化学报，2022，48（4）：957-979.
TIAN Y L，WANG Y T，WANG J G，et al.Key problems and progress of vision Transformers：The state of the art and prospects[J].Acta Automatica Sinica，2022，48（4）：957-979.
[94] 刘文婷，卢新明.基于计算机视觉的Transformer研究进展[J].计算机工程与应用，2022，58（6）：1-16.
LIU W T，LU X M.Research progress of Transformer based on computer vision[J].Computer Engineering and Applications，2022，58（6）：1-16.