Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (1): 1-14.DOI: 10.3778/j.issn.1002-8331.2204-0207

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey of Transformer Research in Computer Vision

LI Xiang, ZHANG Tao, ZHANG Zhe, WEI Hongyang, QIAN Yurong   

  1. College of Software, Xinjiang University, Urumqi 830002, China
  • Online:2023-01-01 Published:2023-01-01



  1. 新疆大学 软件学院,乌鲁木齐 830002

Abstract: Transformer is a deep neural network based on self-attention mechanism. In recent years, Transformer-based models have become a hot research direction in the field of computer vision, and their structures are constantly being improved and expanded, such as local attention mechanisms, pyramid structures, and so on. Through the improved vision model based on Transformer structure, the performance optimization and structure improvement are reviewed and summarized respectively. In addition,the advantages and disadvantages of the respective structures of the Transformer and convolutional neural network(CNN) are compared and analyzed,and a new hybrid structure of CNN+Transformer is introduced. Finally,the development of Transformer in computer vision is summarized and prospected.

Key words: Transformer, convolutional neural network(CNN), hybrid structure, computer vision, deep learning

摘要: Transformer是一种基于自注意力机制的深度神经网络。近几年,基于Transformer的模型已成为计算机视觉领域的热门研究方向,其结构也在不断改进和扩展,比如局部注意力机制、金字塔结构等。通过对基于Transformer结构改进的视觉模型,分别从性能优化和结构改进两个方面进行综述和总结;也对比分析了Transformer和CNN各自结构的优缺点,并介绍了一种新型的CNN+Transformer的混合结构;最后,对Transformer在计算机视觉上的发展进行总结和展望。

关键词: Transformer, 卷积神经网络(CNN), 混合结构, 计算机视觉, 深度学习