计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (1): 1-14.DOI: 10.3778/j.issn.1002-8331.2204-0207
李翔,张涛,张哲,魏宏杨,钱育蓉
出版日期:
2023-01-01
发布日期:
2023-01-01
LI Xiang, ZHANG Tao, ZHANG Zhe, WEI Hongyang, QIAN Yurong
Online:
2023-01-01
Published:
2023-01-01
摘要: Transformer是一种基于自注意力机制的深度神经网络。近几年,基于Transformer的模型已成为计算机视觉领域的热门研究方向,其结构也在不断改进和扩展,比如局部注意力机制、金字塔结构等。通过对基于Transformer结构改进的视觉模型,分别从性能优化和结构改进两个方面进行综述和总结;也对比分析了Transformer和CNN各自结构的优缺点,并介绍了一种新型的CNN+Transformer的混合结构;最后,对Transformer在计算机视觉上的发展进行总结和展望。
李翔, 张涛, 张哲, 魏宏杨, 钱育蓉. Transformer在计算机视觉领域的研究综述[J]. 计算机工程与应用, 2023, 59(1): 1-14.
LI Xiang, ZHANG Tao, ZHANG Zhe, WEI Hongyang, QIAN Yurong. Survey of Transformer Research in Computer Vision[J]. Computer Engineering and Applications, 2023, 59(1): 1-14.
[1] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems,2017:5998-6008. [2] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pretraining of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [3] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[C]//Proceedings of the OpenAI,2018. [4] RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9. [5] BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[C]//Advances in Neural Information Processing Systems,2020:1877-1901. [6] LIU Y,OTT M,GOYAL N,et al.Roberta:a robustly optimized bert pretraining approach[J].arXiv:1907.11692,2019. [7] RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].arXiv:1910.10683,2019. [8] 周志华.机器学习[M].北京:清华大学出版社,2016:6-10. ZHOU Z H.Machine learning[M].Beijing:Tsinghua University Press,2016:6-10. [9] LECUN Y,BOSER B,DENKER J S,et al.Backpropaga tion applied to handwritten zip code recognition[J].Neural Computation,1989,1(4):541-551. [10] HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:770-778. [11] PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//Proceedings of the International Conference on Machine Learning,2018:4055-4064. [12] CARION N,MASSA F,SYNNAEVE G,et al.End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision,2020:213-229. [13] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020. [14] ZHU X,SU W,LU L,et al.Deformable DETR:Deformable transformers for end-to-end object detection[J].arXiv:2010.04159,2020. [15] ZHU X,HU H,LIN S,et al.Deformable convnets V2:More deformable,better results[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:9308-9316. [16] ZHENG S,LU J,ZHAO H,et al.Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:6881-6890. [17] ZENG Y,FU J,CHAO H.Learning joint spatial-temporal transformations for video inpainting[C]//Proceedings of the European Conference on Computer Vision,2020:528-543. [18] SHAW P,USZKOREIT J,VASWANI A.Self-attention with relative position representations[J].arXiv:1803. 02155,2018. [19] LIU X,YU H F,DHILLON I,et al.Learning to encode position for transformer with continuous dynamical model[C]//Proceedings of the International Conference on Machine Learning,2020:6327-6335. [20] ISLAM M A,JIA S,BRUCE N D B.How much position information do convolutional neural networks encode?[J].arXiv:2001.08248,2020. [21] WANG B,ZHAO D,LIOMA C,et al.Encoding word order in complex embeddings[J].arXiv:1912.12333,2019. [22] CHU X,TIAN Z,ZHANG B,et al.Conditional positional encodings for vision transformers[J].arXiv:2102.10882,2021. [23] CHILD R,GRAY S,RADFORD A,et al.Generating long sequences with sparse transformers[J].arXiv:1904. 10509,2019. [24] LIU Z,HU H,LIN Y,et al.Swin Transformer V2:Scaling up capacity and resolution[J].arXiv:2111.09883,2021. [25] LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:10012-10022. [26] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//Proceedings of the European Conference on Computer Vision,2014:740-755. [27] DAI J,QI H,XIONG Y,et al.Deformable convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:764-773. [28] LIN T Y,DOLLáR P,GIRSHICK R,et al.Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:2117-2125. [29] SUN Z,CAO S,YANG Y,et al.Rethinking transformer- based set prediction for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:3611-3620. [30] ZHENG M,GAO P,ZHANG R,et al.End-to-end object detection with adaptive clustering transformer[J].arXiv:2011.09315,2020. [31] BEAL J,KIM E,TZENG E,et al.Toward transformer- based object detection[J].arXiv:2012.09958,2020. [32] LI H,SUI M,ZHAO F,et al.MVT:Mask vision transformer for facial expression recognition in the wild[J].arXiv:2106.04520,2021. [33] TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//Proceedings of the International Conference on Machine Learning,2021:10347-10357. [34] WANG W,XIE E,LI X,et al.Pyramid vision transformer:A versatile backbone for dense prediction without convolu tions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:568-578. [35] HAN K,XIAO A,WU E,et al.Transformer in transformer[C]//Advances in Neural Information Processing Systems,2021:15908-15919. [36] CHU X,TIAN Z,WANG Y,et al.Twins:Revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems,2021:9355-9366. [37] WANG W,XIE E,LI X,et al.PVT V2:Improved baselines with pyramid vision transformer[J].Computational Visual Media,2022,8(3):415-424. [38] WU H,XIAO B,CODELLA N,et al.CVT:Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:22-31. [39] XU W,XU Y,CHANG T,et al.Co-scale conv-attentional image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:9981-9990. [40] YUAN L,CHEN Y,WANG T,et al.Tokens-to-token VIT:Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:558-567. [41] HEO B,YUN S,HAN D,et al.Rethinking spatial dimensions of vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:11936-11945. [42] YUAN K,GUO S,LIU Z,et al.Incorporating convolution designs into visual transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:579-588. [43] CHEN C F,PANDA R,FAN Q.Regionvit:Regional-to-local attention for vision transformers[J].arXiv:2106. 02689,2021. [44] YANG J,LI C,ZHANG P,et al.Focal self-attention for local-global interactions in vision transformers[J].arXiv:2107.00641,2021. [45] DONG X,BAO J,CHEN D,et al.CSwin transformer:A general vision transformer backbone with cross-shaped windows[J].arXiv:2107.00652,2021. [46] LIU Y,SUN G,QIU Y,et al.Transformer in convolutional neural networks[J].arXiv:2106.03180,2021. [47] WU S,WU T,TAN H,et al.Pale transformer:A general vision transformer backbone with pale-shaped attention[J].arXiv:2112.14000,2021. [48] HE K,CHEN X,XIE S,et al.Masked autoencoders are scalable vision learners[J].arXiv:2111.06377,2021. [49] STRUDEL R,GARCIA R,LAPTEV I,et al.Segmenter:Transformer for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:7262-7272. [50] XIE E,WANG W,YU Z,et al.SegFormer:Simple and efficient design for semantic segmentation with transfor mers[C]//Advances in Neural Information Processing Systems,2021:12077-12090. [51] ZHANG H,CISSE M,DAUPHIN Y N,et al.Mixup:Beyond empirical risk minimization[J].arXiv:1710. 09412,2017. [52] CHEN J N,SUN S,HE J,et al.TransMix:Attend to Mix for vision transformers[J].arXiv:2111.09833,2021. [53] JIANG Z H,HOU Q,YUAN L,et al.All tokens matter:Token labeling for training better vision transformers[C]//Advances in Neural Information Processing Systems,2021:18590-18602. [54] CHEN X,XIE S,HE K.An empirical study of training self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:9640-9649. [55] WANG W,YAO L,CHEN L,et al.Crossformer:A versatile vision transformer based on cross-scale attention[J].arXiv:2108.00154,2021. [56] RYOO M S,PIERGIOVANNI A J,ARNAB A,et al.TokenLearner:What can 8 learned tokens do for images and videos?[J].arXiv:2106.11297,2021. [57] CHEN B,LI P,LI B,et al.PSVIT:Better vision transformer via token pooling and attention sharing[J].arXiv:2108.03428,2021. [58] SHAZEER N,LAN Z,CHENG Y,et al.Talking-heads attention[J].arXiv:2003.02436,2020. [59] ZHOU D,SHI Y,KANG B,et al.Refiner:Refining self-attention for vision transformers[J].arXiv:2106.03714,2021. [60] ZHOU D,KANG B,JIN X,et al.Deepvit:Towards deeper vision transformer[J].arXiv:2103.11886,2021. [61] TOUVRON H,CORD M,SABLAYROLLES A,et al.Going deeper with image transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:32-42. [62] ALI A,TOUVRON H,CARON M,et al.XCIT:Cross-covariance image transformers[C]//Advances in Neural Information Processing Systems,2021:20014-20027. [63] ZHANG P,DAI X,YANG J,et al.Multi-scale vision longformer:A new vision transformer for high-resolution image encoding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:2998-3008. [64] BELTAGY I,PETERS M E,COHAN A.Longformer:The long-document transformer[J].arXiv:2004.05150,2020. [65] WANG P,WANG X,WANG F,et al.KVT:KNN attention for boosting vision transformers[J].arXiv:2106.00515,2021. [66] XIA Z,PAN X,SONG S,et al.Vision transformer with deformable attention[J].arXiv:2201.00520,2022. [67] HAN Q,FAN Z,DAI Q,et al.Demystifying local vision transformer:Sparse connectivity,weight sharing,and dynamic weight[J].arXiv:2106.04263,2021. [68] LIN H,CHENG X,WU X,et al.CAT:Cross attention in vision transformer[J].arXiv:2106.05786,2021. [69] WU K,PENG H,CHEN M,et al.Rethinking and improving relative position encoding for vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:10033-10041. [70] TANG Y,HAN K,XU C,et al.Augmented shortcuts for vision transformers[C]//Advances in Neural Information Processing Systems,2021:15316-15327. [71] TOLSTIKHIN I O,HOULSBY N,KOLESNIKOV A,et al.Mlp-mixer:An all-mlp architecture for vision[C]//Advances in Neural Information Processing Systems,2021:24261-24272. [72] TOUVRON H,BOJANOWSKI P,CARON M,et al.Resmlp:Feedforward networks for image classification with data-efficient training[J].arXiv:2105.03404,2021. [73] MELAS-KYRIAZI L.Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet[J].arXiv:2105.02723,2021. [74] CHEN J,LU Y,YU Q,et al.Transunet:Transformers make strong encoders for medical image segmentation[J].arXiv:2102.04306,2021. [75] XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[C]//Advances in Neural Information Processing Systems,2021:30392-30400. [76] LI K,WANG Y,ZHANG J,et al.UniFormer:Unifying convolution and self-attention for visual recognition[J].arXiv:2201.09450,2022. [77] GRAHAM B,EL-NOUBY A,TOUVRON H,et al.LeViT:A vision transformer in ConvNet’s clothing for faster inference[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:12259-12269. [78] WANG X,GIRSHICK R,GUPTA A,et al.Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:7794-7803. [79] D’ASCOLI S,TOUVRON H,LEAVITT M L,et al.Convit:Improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the International Conference on Machine Learning,2021:2286-2296. [80] PENG Z,HUANG W,GU S,et al.Conformer:Local features coupling global representations for visual recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:367-376. [81] CHEN Y,DAI X,CHEN D,et al.Mobile-former:Bridging mobilenet and transformer[J].arXiv:2108.05895,2021. [82] HUANG Z,WANG X,HUANG L,et al.CCnet:Criss- cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:603-612. [83] SRINIVAS A,LIN T Y,PARMAR N,et al.Bottleneck transformers for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:16519-16529. [84] HU H,ZHANG Z,XIE Z,et al.Local relation networks for image recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:3464-3473. [85] HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:7132-7141. [86] DE BRABANDERE B,JIA X,TUYTELAARS T,et al.Dynamic filter networks[J].arXiv:1605.09673,2016. [87] HU J,SHEN L,ALBANIE S,et al.Gather-excite:Exploiting feature context in convolutional neural networks[J].arXiv:1810.12348,2018. [88] YU C,XIAO B,GAO C,et al.Lite-HRNet:A lightweight high-resolution network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:10440-10450. [89] LI D,HU J,WANG C,et al.Involution:Inverting the inherence of convolution for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:12321-12330. [90] TROCKMAN A,KOLTER J Z.Patches are all you need?[J].arXiv:2201.09792,2022. [91] KHAN S,NASEER M,HAYAT M,et al.Transformers in vision:A survey[J].ACM Computing Surveys(CSUR),2022,54(10):1-41. [92] HAN K,WANG Y,CHEN H,et al.A survey on visual transformer[J].arXiv:2012.12556,2020. [93] 田永林,王雨桐,王建功,等.视觉Transformer研究的关键问题:现状及展望[J].自动化学报,2022,48(4):957-979. TIAN Y L,WANG Y T,WANG J G,et al.Key problems and progress of vision Transformers:The state of the art and prospects[J].Acta Automatica Sinica,2022,48(4):957-979. [94] 刘文婷,卢新明.基于计算机视觉的Transformer研究进展[J].计算机工程与应用,2022,58(6):1-16. LIU W T,LU X M.Research progress of Transformer based on computer vision[J].Computer Engineering and Applications,2022,58(6):1-16. |
[1] | 胡欣珏, 付章杰. 高图像质量的一图藏两图方法[J]. 计算机工程与应用, 2023, 59(4): 235-242. |
[2] | 淦亚婷, 安建业, 徐雪. 基于深度学习的短文本分类方法研究综述[J]. 计算机工程与应用, 2023, 59(4): 43-53. |
[3] | 杨坤融, 熊余, 张健, 储雯. 面向长短期混合数据的MOOC辍学预测策略研究[J]. 计算机工程与应用, 2023, 59(4): 130-138. |
[4] | 李玲, 郭广颂. 融合指标分组的高维混合多目标进化优化[J]. 计算机工程与应用, 2023, 59(4): 165-174. |
[5] | 杨鹤, 柏正尧. CoT-TransUNet:轻量化的上下文Transformer医学图像分割网络[J]. 计算机工程与应用, 2023, 59(3): 218-225. |
[6] | 张晗, 郑伟昊, 窦志成, 文继荣. 融合法律文本结构信息的刑事案件判决预测[J]. 计算机工程与应用, 2023, 59(3): 253-263. |
[7] | 杨寒雨, 赵晓永, 王磊. 数据归一化方法综述[J]. 计算机工程与应用, 2023, 59(3): 13-22. |
[8] | 陈晓婷, 李实. 对话情绪识别综述[J]. 计算机工程与应用, 2023, 59(3): 33-48. |
[9] | 杜昱峥, 曹慧, 聂永琦, 魏德健, 冯妍妍. 深度学习在阿尔茨海默病分类诊断中的应用[J]. 计算机工程与应用, 2023, 59(3): 49-65. |
[10] | 林鸿辉, 刘建华, 郑智雄, 胡任远, 罗逸轩. 联合对话行为识别与情感分类的多任务网络[J]. 计算机工程与应用, 2023, 59(3): 104-111. |
[11] | 丁上上, 郑田莉, 姚康, 张贺童, 裴融浩, 付威威. 深度学习屈光检测方法研究[J]. 计算机工程与应用, 2023, 59(3): 193-201. |
[12] | 张冬冬, 郭杰, 陈阳. 基于原始点云的三维目标检测算法[J]. 计算机工程与应用, 2023, 59(3): 209-217. |
[13] | 潘梦竹, 李千目, 邱天. 深度多模态表示学习的研究综述[J]. 计算机工程与应用, 2023, 59(2): 48-64. |
[14] | 韦世红, 刘红梅, 唐宏, 朱龙娇. 多级度量网络的小样本学习[J]. 计算机工程与应用, 2023, 59(2): 94-101. |
[15] | 杨秀璋, 武帅, 杨琪, 项美玉, 李娜, 周既松, 赵小明. 多视图融合TextRCNN的论文自动推荐算法[J]. 计算机工程与应用, 2023, 59(2): 110-119. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||