CoT-TransUNet：Lightweight Context Transformer Medical Image Segmentation Network

doi:10.3778/j.issn.1002-8331.2205-0046

Abstract

Abstract: Aiming at the problem that the receptive field of convolution in the previous medical image segmentation network is too small and the feature loss of Transformer, an end-to-end lightweight context Transformer medicalimage segmentation network（lightweight context Transformer medical image segmentation network, CoT-TransUNet） is proposed. The network consists of three parts：encoder，decoder, and skip connections. For the input image，the encoder uses the CoTNet as a feature extractor to generate feature maps. Transformer blocks encode feature maps as input sequences. Then, the decoder upsamples the encoded features through a cascaded upsampler. The upsampler cascades multiple upsampling blocks, each of which employs the CARAFE upsampling operator. Finally, feature aggregation of the encoder and decoder at different resolutions is achieved through skip connections. CoT-TransUNet adopts CoTNet which combines global and local context information in the feature extraction stage. CARAFE operator with larger receptive field is adopted in the upsampling stage. It generates better input feature maps, as well as content-based upsampling, while remaining lightweight. Experiments on multi-organ segmentation tasks show that CoT-TransUNet achieves better performance than other networks.

Key words: medical image segmentation, context Transformer network, cascaded upsampler, lightweight

摘要： 针对以往医学图像分割网络中卷积的感受野太小以及Transformer的特征丢失问题，提出了一种端到端的轻量化上下文Transformer医学图像分割网络（lightweight context Transformer medical image segmentation network，CoT-TransUNet)。该网络由编码器、解码器以及跳跃连接三部分组成。对于输入图像，编码器使用CoTNet-Transformer的混合模块，采用CoTNet作为特征提取器来生成特征图。Transformer块则把特征图编码为输入序列。解码器通过一个级联上采样器，将编码后的特征进行上采样。该上采样器级联了多个上采样块，每个上采样块都采用CARAFE上采样算子。通过跳跃连接实现编码器与解码器在不同分辨率上的特征聚合。CoT-TransUNet通过在特征提取阶段采用全局与局部上下文信息相结合的CoTNet；在上采样阶段采用具有更大感受野的CARAFE算子。实现了生成更好的输入特征图，以及基于内容的上采样，并保持轻量化。在多器官分割任务的实验中，CoT-TransUNet取得了优于其他网络的性能。

关键词: 医学图像分割, 上下文Transformer网络, 级联上采样器, 轻量化

YANG He, BAI Zhengyao. CoT-TransUNet：Lightweight Context Transformer Medical Image Segmentation Network[J]. Computer Engineering and Applications, 2023, 59(3): 218-225.

杨鹤, 柏正尧. CoT-TransUNet:轻量化的上下文Transformer医学图像分割网络[J]. 计算机工程与应用, 2023, 59(3): 218-225.

References

[1] RONNEBERGER O，FISCHER P，BROX T.U-Net：convolutional networks for biomedical image segmentation[C]//Medical Image Computing and Computer-Assisted Intervention（MICCAI），2015：234-241.
[2] ISENSEE F，JAEGER P F，KOHL S，et al.nnU-Net：a self-configuring method for deep learning-based biomedical image segmentation[J].Nature Methods，2021，18（2）：203-211.
[3] JIN Q，MENG Z，SUN C，et al.RA-UNet：a hybrid deep attention-aware network to extractliver and tumor in CT scans[J].Frontiers in Bioengineering and Biotechnology，2018：1471.
[4] IEK Z，ABDULKADIR A，LIENKMP S S，et al.3D U-Net：learning dense volumetric segmentation from sparse annotation[C]//Medical Image Computing and Computer-Assisted Intervention（MICCAI），Oct 2016：424-432.
[5] XIAO X，SHEN L，LUO Z，et al.Weighted Res-UNet for high-quality retina vessel segmentation[C]//2018 9th International Conference on Information Technology in Medicine and Educatio（ITME），2018：327-331.
[6] ZHOU Z，SIDDIQUEE M，TAJBAKHS H N，et al.UNet++：a nested U-Net architecture for medical image segmentation[C]//4th Deep Learning in Medical Image Analysis（DLMIA） Workshop，2018：3-11.
[7] HUANG H，LIN L，TONG R，et al.UNet 3+：a full-scale connected UNet for medical image segmentation[C]//2020 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2020.
[8] CHEN J，LU Y，YU Q，et al.Transunet：transformers make strongen coders for medical image segmentation[J].arXiv：2102.04306，2021.
[9] SCHLEMPER J，OKTAY O，SCHAAP M，et al.Attention gated networks：learning to leverage salient regions in medical images[J].Medical Image Analysis，2019，53：197-207.
[10] WANG X，GIRSHICK R，GUPTA A，et al.Non-local neural networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2018：7794-7803.
[11] BELLO I，ZOPH B，LE Q，et al.Attention augmented convolutional networks[C]//2019 IEEE/CVF International Conference on Computer Vision（ICCV），2020.
[12] CARION N，MASSA F，SYNNAEVE G，et al.End-to-end object detection with transformers[C]//16th European Conference on Computer Vision（ECCV），2020：213-229.
[13] CAO H，WANG Y，CHEN J，et al.Swin-Unet：unet-like pure transformer for medical image segmentation[J].arXiv：2105.05537，2020.
[14] DOSOVITSKIY A，BEYER L，KOLE SNIKOV A，et al.An image is worth 16×16 words：transformers for image recognition at scale[J].arXiv：2010.11929，2020.
[15] RAMACHANDRAN P，PARMAR N，VASWANI A，et al.Stand-alone self-attention in vision models[J].arXiv：1906.05909，2019.
[16] LI Y H，YAO T，PAN Y W，et al.Contextual transformer networks for visual recognition[J].arXiv：2107.12292，2021.
[17] WANG J，CHEN K，XU R，et al.CARAFE：content-aware reassembly of features[C]//IEEE/CVF International Conference on Computer Vision（ICCV），2020.
[18] FU S，LU Y，WANG Y，et al.Domain adaptive relational reasoning for 3D multi-organ segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention，2020：656-666.
[19] JIA D，WEI D，SOCHER R，et al.ImageNet：a large-scale hierarchical image data base[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition，2009：248-255.
[20] MILLETARI F，NAVAB N，AHMADIS A.V-Net：fully convolutional neural networks for volumetric medical image segmentation[C]//2016 Fourth International Conference on 3D Vision（3DV），2016.
[21] OKTAY O，SCHLEMPER J，FOLGOC L L，et al.Attention U-Net：learning where to look for the pancreas[J].arXiv：1804.03999，2018.
[22] LI Y，PAN Y，YAO T，et al.Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network[C]//Association for the Advancement of Artificial Intelligence（AAAI），2021.
[23] PAN Y，YAO T，LI Y，et al.X-linear attention networks for image captioning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2020.
[24] VASWANI A，SHAZEER N，PARMA R N，et al.Attention is all you need[J].Advances in Neural Information Processing Systems（NIPS），2017：5998-6008.
[25] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[26] HAASE D，AMTHOR M.Rethinking depthwise separable convolutions：how intra-kernel correlations lead to improved mobileNets[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2020.
[27] KAI C，PANG J，WANG J，et al.Hybrid task cascade for instance segmentation[C]//IEEE/CVF Conference on Computer Vision & Pattern Recognition，2019.
[28] HU J，SHEN L，ALBANIE S，et al.Squeeze-and-excitation networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2020，42（8）：2011-2023.