Review of Research on Application of Vision Transformer in Medical Image Analysis

doi:10.3778/j.issn.1002-8331.2206-0022

Abstract

Abstract: Deep self-attentive network（Transformer） has a natural ability to model global features and long-range correlations of input information, which is strongly complementary to the inductive bias property of convolutional neural networks（CNN）. Inspired by its great success in natural language processing, Transformer has been widely introduced into various computer vision tasks, especially medical image analysis, and has achieved remarkable performance. In this paper, it first introduces the typical work of vision Transformer on natural images, and then organizes and summarizes the related work according to different lesions or organs in the subfields of medical image segmentation, medical image classification and medical image registration, focusing on the implementation ideas of some representative work. Finally, current researches are discussed and the future direction is pointed out. The purpose of this paper is to provide a reference for further in-depth research in this field.

Key words: vision Transformer, medical image segmentation, medical image classification, medical image registration

摘要： 深度自注意力网络（Transformer）对输入信息全局特征和长距离相关性具有天然良好的建模能力，其与卷积神经网络（CNN）的归纳偏置特性具有较强互补性。受其在自然语言处理领域取得巨大成功的启发，Transformer已被广泛引入到计算机视觉各项任务特别是医学图像分析领域并已取得了不俗表现。对Transformer与自然图像结合的典型工作进行介绍，根据视觉Transformer在医学图像分割、医学图像分类以及医学图像配准等子领域对相关工作按照不同病灶及部位进行了整理和归纳，重点对一些代表性研究工作的实现思想进行了详细分析。对现有研究工作进行了讨论并对未来方向进行了展望，以期为该领域的进一步深入研究提供参考。

关键词: 视觉Transformer, 医学图像分割, 医学图像分类, 医学图像配准

SHI Lei, JI Qingyu, CHEN Qingwei, ZHAO Hengyi, ZHANG Junxing. Review of Research on Application of Vision Transformer in Medical Image Analysis[J]. Computer Engineering and Applications, 2023, 59(8): 41-55.

石磊, 籍庆余, 陈清威, 赵恒毅, 张俊星. 视觉Transformer在医学图像分析中的应用研究综述[J]. 计算机工程与应用, 2023, 59(8): 41-55.

References

[1] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[2] SZEGEDY C，VANHOUCKE V，IOFFE S，et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2818-2826.
[3] TAN M，LE Q.Efficientnet：rethinking model scaling for convolutional neural networks[C]//International Conference on Machine Learning，2019：6105-6114.
[4] CHEN L C，PAPANDREOU G，SCHROFF F，et al.Rethinking atrous convolution for semantic image segmentation[J].arXiv：1706.05587，2017.
[5] RONNEBERGER O，FISCHER P，BROX T.U-net：convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-assisted Intervention.Cham：Springer，2015：234-241.
[6] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[7] WANG H，ZHU Y，GREEN B，et al.Axial-deeplab：stand-alone axial-attention for panoptic segmentation[C]//European Conference on Computer Vision.Cham：Springer，2020：108-126.
[8] WANG Q，WU B，ZHU P，et al.Eca-net：efficient channel attention for deep convolutional neural networks[J].arXiv：1910.03151，2019.
[9] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017.
[10] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16x16 words：transformers for image recognition at scale[J].arXiv：2010.11929，2020.
[11] TOUVRON H，CORD M，DOUZE M，et al.Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning，2021：10347-10357.
[12] LIU Z，LIN Y，CAO Y，et al.Swin transformer：hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：10012-10022.
[13] YUAN L，CHEN Y，WANG T，et al.Tokens-to-token vit：training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：558-567.
[14] CHEN X，CAO Q，ZHONG Y，et al.DearKD：data-efficient early knowledge distillation for vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：12052-12062.
[15] YUYAO G，YITING C，JIA W，et al.Vision transformer based on knowledge distillation in TCM image classification[C]//2022 IEEE 5th International Conference on Computer and Communication Engineering Technology（CCET），2022：120-125.
[16] ZHANG L，WEN Y.A transformer-based framework for automatic COVID19 diagnosis in chest CTs[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：513-518.
[17] HSU C C，CHEN G L，WU M H.Visual transformer with statistical test for covid-19 classification[J].arXiv：2107.
05334，2021.
[18] LIN A，CHEN B，XU J，et al.DS-TransUNet：dual swin transformer u-net for medical image segmentation[J].arXiv：2106.06716，2021.
[19] CAO H，WANG Y，CHEN J，et al.Swin-Unet：Unet-like pure transformer for medical image segmentation[J].arXiv：2105.05537，2021.
[20] HATAMIZADEH A，NATH V，TANG Y，et al.Swin UNETR：swin transformers for semantic segmentation of brain tumors in MRI images[C]//International MICCAI Brainlesion Workshop.Cham：Springer，2022：272-284.
[21] SIRINUKUNWATTANA K，PLUIM J P W，CHEN H，et al.Gland segmentation in colon histology images：the glas challenge contest[J].Medical Image Analysis，2017，35：489-502.
[22] CODELLA N，ROTEMBERG V，TSCHANDL P，et al.Skin lesion analysis toward melanoma detection 2018：a challenge hosted by the international skin imaging collaboration（ISIC）[J].arXiv：1902.03368，2019.
[23] FAN D P，JI G P，ZHOU T，et al.Pranet：parallel reverse attention network for polyp segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2020：263-273.
[24] MENZE B H，JAKAB A，BAUER S，et al.The multimodal brain tumor image segmentation benchmark（BRATS）[J].IEEE Transactions on Medical Imaging，2014，34（10）：1993-2024.
[25] ZHOU Z，RAHMAN SIDDIQUEE M M，TAJBAKHSH N，et al.Unet++：a nested u-net architecture for medical image segmentation[M]//Deep learning in medical image analysis and multimodal learning for clinical decision support.Cham：Springer，2018：3-11.
[26] HUANG H，LIN L，TONG R，et al.Unet 3+：a full-scale connected unet for medical image segmentation[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020：1055-1059.
[27] OKTAY O，SCHLEMPER J，FOLGOC L L，et al.Attention u-net：learning where to look for the pancreas[J].arXiv：1804.03999，2018.
[28] ISENSEE F，PETERSEN J，KLEIN A，et al.nnU-Net：self-adapting framework for u-net-based medical image segmentation[J].arXiv：1809.10486，2018.
[29] VALANARASU J M J，OZA P，HACIHALILOGLU I，et al.Medical transformer：gated axial-attention for medical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：36-46.
[30] ZHANG Y，HIGASHITA R，FU H，et al.A multi-branch hybrid transformer network for corneal endothelial cell segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：99-108.
[31] JI Y，ZHANG R，WANG H，et al.Multi-compound transformer for accurate biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：326-336.
[32] WANG H，CAO P，WANG J，et al.UCTransNet：rethinking the skip connections in U-Net from a channel-wise perspective with transformer[J].arXiv：2109.04335，2021.
[33] XU G，WU X，ZHANG X，et al.Levit-unet：make faster encoders with transformer for medical image segmentation[J].arXiv：2107.08623，2021.
[34] GRAHAM B，EL-NOUBY A，TOUVRON H，et al.LeViT：a vision transformer in ConvNet’s clothing for faster inference[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：12259-12269.
[35] CHEN J，LU Y，YU Q，et al.Transunet：transformers make strong encoders for medical image segmentation[J].arXiv：2102.04306，2021.
[36] PETIT O，THOME N，RAMBOUR C，et al.U-net transformer：self and cross attention for medical image segmentation[C]//International Workshop on Machine Learning in Medical Imaging.Cham：Springer，2021：267-276.
[37] CHANG Y，MENGHAN H，GUANGTAO Z，et al.Transclaw u-net：claw u-net with transformers for medical image segmentation[J].arXiv：2107.05188，2021.
[38] GAO Y，ZHOU M，METAXAS D N.UTNet：a hybrid transformer architecture for medical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：61-71.
[39] WANG H，XIE S，LIN L，et al.Mixed transformer u-net for medical image segmentation[C]//2022 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2022：2390-2394.
[40] JI G P，CHOU Y C，FAN D P，et al.Progressively normalized self-attention network for video polyp segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：142-152.
[41] LI S，SUI X，LUO X，et al.Medical image segmentation using squeeze-and-expansion transformers[J].arXiv：2105.09511，2021.
[42] ZHANG Y，LIU H，HU Q.Transfuse：fusing transformers and CNNs for medical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：14-24.
[43] CHEN B，LIU Y，ZHANG Z，et al.Transattunet：multi-level attention-guided u-net with transformer for medical image segmentation[J].arXiv：2107.05274，2021.
[44] WANG J，WEI L，WANG L，et al.Boundary-aware transformers for skin lesion segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：206-216.
[45] WU H，CHEN S，CHEN G，et al.FAT-Net：feature adaptive transformers for automated skin lesion segmentation[J].Medical Image Analysis，2022，76：102327.
[46] HE X，TAN E L，BI H，et al.Fully transformer network for skin lesion analysis[J].Medical Image Analysis，2022：102357.
[47] HATAMIZADEH A，TANG Y，NATH V，et al.Unetr：transformers for 3d medical image segmentation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision，2022：574-584.
[48] WANG W，CHEN C，DING M，et al.TransBTS：multimodal brain tumor segmentation using transformer[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：109-119.
[49] SHOME D，KAR T，MOHANTY S N，et al.Covid-transformer：interpretable covid-19 detection using vision transformer for healthcare[J].International Journal of Environmental Research and Public Health，2021，18（21）：11086.
[50] SELVARAJU R R，COGSWELL M，DAS A，et al.Grad-cam：visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：618-626.
[51] GAO X，QIAN Y，GAO A.COVID-VIT：classification of COVID-19 from CT chest images based on vision transformer models[J].arXiv：2107.01682，2021.
[52] HUANG G，LIU Z，VAN DER MAATEN L，et al.Densely connected convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：4700-4708.
[53] PARK S，KIM G，OH Y，et al.Vision transformer for covid-19 cxr diagnosis using chest x-ray feature corpus[J].arXiv：2103.07055，2021.
[54] CHEFER H，GUR S，WOLF L.Transformer interpretability beyond attention visualization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：782-791.
[55] REY D，NEUH?USER M.Wilcoxon-signed-rank test[M]//International encyclopedia of statistical science.Berlin，Heidelberg：Springer，2011：1658-1659.
[56] PARK S，KIM G，KIM J，et al.Federated split vision transformer for COVID-19CXR diagnosis using task-agnostic training[J].arXiv：2111.01338，2021.
[57] PERERA S，ADHIKARI S，YILMAZ A.POCFormer：a lightweight transformer architecture for detection of COVID-19 using point of care ultrasound[C]//2021 IEEE International Conference on Image Processing（ICIP），2021：195-199.
[58] WANG S，LI B Z，KHABSA M，et al.Linformer：self-attention with linear complexity[J].arXiv：2006.04768，2020.
[59] SUN R，LI Y，ZHANG T，et al.Lesion-aware transformers for diabetic retinopathy grading[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：10938-10947.
[60] DECENCIèRE E，ZHANG X，CAZUGUEL G，et al.Feedback on a publicly distributed image database：the Messidor database[J].Image Analysis & Stereology，2014，33（3）：231-234.
[61] YANG H，CHEN J，XU M.Fundus disease image classification based on improved transformer[C]//2021 International Conference on Neuromorphic Computing（ICNC），2021：207-214.
[62] YU S，MA K，BI Q，et al.Mil-vt：multiple instance learning enhanced vision transformer for fundus image classification[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：45-54.
[63] PACHADE S，PORWAL P，THULKAR D，et al.Retinal fundus multi-disease image dataset（RFMiD）：a dataset for multi-disease detection research[J].Data，2021，6（2）：14.
[64] GHEFLATI B，RIVAZ H.Vision transformers for classification of breast ultrasound images[C]//2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society（EMBC），2022：480-483.
[65] ISLAM M N，HASAN M，HOSSAIN M，et al.Vision transformer and explainable transfer learning models for auto detection of kidney cyst，stone and tumor from CT-radiography[J].Scientific Reports，2022，12（1）：1-14.
[66] QU X，LU H，TANG W，et al.A VGG attention vision transformer network for benign and malignant classification of breast ultrasound images[J].Medical Physics，2022，49（9）：5787-5798.
[67] KHAN A，LEE B.Gene transformer：transformers for the gene expression-based classification of lung cancer subtypes[J].arXiv：2108.11833，2021.
[68] DUAN H，LIU Y，YAN H，et al.Fourier ViT：a multi-scale vision transformer with Fourier transform for histopathological image classification[C]//2022 7th International Conference on Automation，Control and Robotics Engineering（CACRE），2022：189-193.
[69] ZHENG Y，GINDRA R H，GREEN E J，et al.A graph-transformer for whole slide image classification[J].arXiv：2205.09671，2022.
[70] SHAO Z，BIAN H，CHEN Y，et al.Transmil：transformer based correlated multiple instance learning for whole slide image classification[C]//Advances in Neural Information Processing Systems，2021：2136-2147.
[71] HE Z，LIN M，XU Z，et al.Deconv-transformer（DecT）：a histopathological image classification model for breast cancer based on color deconvolution and transformer architecture[J].Information Sciences，2022，608：1093-1112.
[72] CHEN J，HE Y，FREY E C，et al.Vit-v-net：vision transformer for unsupervised volumetric medical image registration[J].arXiv：2104.06468，2021.
[73] CHEN J，FREY E C，HE Y，et al.Transmorph：transformer for unsupervised medical image registration[J].Medical Image Analysis，2022：102615.
[74] ZHANG Y，PEI Y，ZHA H.Learning dual transformer network for diffeomorphic registration[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2021：129-138.
[75] MARCUS D S，WANG T H，PARKER J，et al.Open access series of imaging studies（OASIS）：cross-sectional MRI data in young，middle aged，nondemented，and demented older adults[J].Journal of Cognitive Neuroscience，2007，19（9）：1498-1507.
[76] MOK T C W，CHUNG A.Affine medical image registration with coarse-to-fine vision transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：20835-20844.
[77] SHATTUCK D W，MIRZA M，ADISETIYO V，et al.Construction of a 3D probabilistic atlas of human cortical structures[J].Neuroimage，2008，39（3）：1064-1080.
[78] SHI J，HE Y，KONG Y，et al.XMorpher：full transformer for deformable medical image registration via cross attention[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2022：217-226.
[79] ZHUANG X，SHEN J.Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI[J].Medical Image Analysis，2016，31：77-87.
[80] XIE K，YANG Y，PAGNUCCO M，et al.Electron microscope image registration using Laplacian sharpening transformer U-Net[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2022：310-319.
[81] CHEN J，LU D，ZHANG Y，et al.Deformer：towards displacement field learning for unsupervised medical image registration[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2022：141-151.
[82] ZHU Y，LU S.Swin-VoxelMorph：a symmetric unsupervised learning model for deformable medical image regi-
stration using swin transformer[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2022：78-87.
[83] MUELLER S G，WEINER M W，THAL L J，et al.Ways toward an early diagnosis in Alzheimer’s disease：the Alzheimer’s disease neuroimaging initiative（ADNI）[J].Alzheimer’s & Dementia，2005，1（1）：55-66.
[84] MAREK K，JENNINGS D，LASCH S，et al.The Parkinson progression marker initiative（PPMI）[J].Progress in Neurobiology，2011，95（4）：629-635.
[85] SHAMSHAD F，KHAN S，ZAMIR S W，et al.Transformers in medical imaging：a survey[J].arXiv：2201.09873，2022.
[86] PARVAIZ A，KHALID M A，ZAFAR R，et al.Vision transformers in medical computer vision--a contemplative retrospection[J].arXiv：2203.15269，2022.
[87] HE K，GAN C，LI Z，et al.Transformers in medical image analysis：a review[J].arXiv：2202.12165，2022.