LSTFormer：Lightweight Semantic Segmentation Network Based on Swin Transformer

doi:10.3778/j.issn.1002-8331.2210-0331

Abstract

Abstract: Aiming at the general problem of high computational complexity in existing semantic segmentation networks based on Transformer, a lightweight semantic segmentation network based on Swin Transformer is proposed. Firstly, feature maps of multiple scales are obtained by Swin Transformer. Secondly, the full perception module and the improved cascading fusion module are used to fuse the feature maps of different scales across layers, reducing the semantic gap between the feature maps of different levels. Then, a single Swin Transformer block is introduced to optimize the initial segmentation feature mapping and improve the ability of the network to classify different pixels through the moving window autoattention mechanism. Finally, Dice loss function and cross-entropy loss function are added in the training stage to improve the segmentation performance and convergence speed of the network. The experimental results show that the mIoU of LSTFormer on ADE20K and Cityscapes reaches 49.47% and 81.47%. Compared with similar networks such as SETR and Swin-UPerNet, LSTFormer has lower parameters and computation while maintaining the same segmentation accuracy.

Key words: lightweight semantic segmentation, Swin Transformer, cross layer fusion, self attention mechanism, loss fusion

摘要： 针对现有基于Transformer的语义分割网络普遍存在计算复杂度高的问题，提出了一种基于Swin Transformer的轻量化语义分割网络。该网络通过Swin Transformer获取多个尺度的特征图；采用全感知模块和改进的级联融合模块跨层融合不同尺度的特征图，减小不同层级特征图的语义差距；引入单个Swin Transformer block对初分割特征映射进行优化，通过移动窗口自注意力机制提升网络对不同像素点进行分类的能力；训练阶段加入Dice损失函数和交叉熵损失函数，提高网络的分割性能和收敛速度。实验结果表明，LSTFormer在数据集ADE20K和Cityscapes上mIoU分别达到49.47％和81.47％，相较于SETR和Swin-UPerNet等同类网络，LSTFormer在保持相当分割精度的同时具有更低的参数量和计算量。

关键词: 轻量化语义分割, Swin Transformer, 跨层融合, 自注意力机制, 损失函数

YANG Cheng, GAO Jianlin, ZHENG Meilin, DING Rong. LSTFormer：Lightweight Semantic Segmentation Network Based on Swin Transformer[J]. Computer Engineering and Applications, 2023, 59(12): 166-175.

杨承, 高建瓴, 郑美琳, 丁容. LSTFormer：基于Swin Transformer的轻量化语义分割网络[J]. 计算机工程与应用, 2023, 59(12): 166-175.

References

[1] 梁新宇，罗晨，权冀川，等.基于深度学习的图像语义分割技术研究进展[J].计算机工程与应用，2020，56（2）：18-28.
LIANG X Y，LUO C，QUAN J C，et al.Research on progress of image semantic segmentation based on deep learning[J].Computer Engineering and Applications，2020，56（2）：18-28.
[2] 徐辉，祝玉华，甄彤，等.深度神经网络图像语义分割方法综述[J].计算机科学与探索，2021，15（1）：47-59.
XU H，ZHU Y H，ZHEN T，et al.Survey of image semantic segmentation methods based on deep neural network[J].Journal of Frontiers of Computer Science and Technology，2021，15（1）：47-59.
[3] LONG J，SHELHAMER E，DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3431-3440.
[4] ZHAO H，SHI J，QI X，et al.Pyramid scene parsing network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：2881-2890.
[5] CHEN L C，PAPANDREOU G，KOKKINOS I，et al.Semantic image segmentation with deep convolutional nets and fully connected CRFs[J].arXiv：1412.7062，2014.
[6] CHEN L C，PAPANDREOU G，KOKKINOS I，et al.Deeplab：semantic image segmentation with deep convolutional nets，atrous convolution，and fully connected CRFs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，40（4）：834-848.
[7] CHEN L C，PAPANDREOU G，SCHROFF F，et al.Rethinking atrous convolution for semantic image segmentation[J].arXiv：1706.05587，2017.
[8] CHEN L C，ZHU Y，PAPANDREOU G，et al.Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European Conference on Computer Vision，2018：801-818.
[9] ZHAO H，QI X，SHEN X，et al.Icnet for real-time semantic segmentation on high-resolution images[C]//Proceedings of the European Conference on Computer Vision，2018：405-420.
[10] SANDLER M，HOWARD A，ZHU M，et al.Mobilenetv2：inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：4510-4520.
[11] WU T，TANG S，ZHANG R，et al.Cgnet：a light-weight context guided network for semantic segmentation[J].IEEE Transactions on Image Processing，2020，30：1169-1179.
[12] YU C，GAO C，WANG J，et al.Bisenet v2：bilateral network with guided aggregation for real-time semantic segmentation[J].International Journal of Computer Vision，2021，129（11）：3051-3068.
[13] 李翔，张涛，张哲，等.Transformer在计算机视觉领域的研究综述[J].计算机工程与应用，2023，59（1）：1-14.
LI X，ZHANG T，ZHANG Z，et al.Survey of Transformer research in computer vision[J].Computer Engineering and Applications，2023，59（1）：1-14.
[14] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16x16 words：transformers for image recognition at scale[C]//International Conference on Learning Representations，2021.
[15] LI Y，YUAN G，WEN Y，et al.EfficientFormer：vision transformers at mobilenet speed[J].arXiv：2206.01191，2022.
[16] CHEN Y，DAI X，CHEN D，et al.Mobile-former：bridging mobilenet and transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：5270-5279.
[17] LIU Z，LIN Y，CAO Y，et al.Swin transformer：hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：10012-10022.
[18] WANG Q，DONG X，WANG R，et al.Swin transformer based pyramid pooling network for food segmentation[C]//2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence，2022：64-68.
[19] SHI W，XU J，GAO P.SSformer：a lightweight transformer for semantic segmentation[J].arXiv：2208.02034，2022.
[20] JIANG X，LI Y，JIANG T，et al.RoadFormer：pyramidal deformable vision transformers for road network extraction with remote sensing images[J].International Journal of Applied Earth Observation and Geoinformation，2022，113：102987.
[21] LU L，XIAO Y，CHANG X，et al.Deformable attention-oriented feature pyramid network for semantic segmentation[J].Knowledge-Based Systems，2022，254：109623.
[22] YU L，LI Z，ZHANG J，et al.Self-attention on multi-shifted windows for scene segmentation[J].arXiv：2207.04403，2022.
[23] XIAO T，LIU Y，ZHOU B，et al.Unified perceptual parsing for scene understanding[C]//Proceedings of the European Conference on Computer Vision，2018：418-434.
[24] 刘腊梅，王晓娜，刘万军，等.融合转置卷积与深度残差图像语义分割方法[J].计算机科学与探索，2022，16（9）：2132-2142.
LIU L M，WANG X N，LIU W J，et al.Image semantic segmentation method with fusion of transposed convolution and deep residual[J].Journal of Frontiers of Computer Science and Technology，2022，16（9）：2132-2142.
[25] DONG B，WANG W，FAN D P，et al.Polyp-pvt：polyp segmentation with pyramid vision transformers[J].arXiv：2108.06932，2021.
[26] XIE E，WANG W，YU Z，et al.SegFormer：simple and efficient design for semantic segmentation with transformers[J].Advances in Neural Information Processing Systems，2021，34：12077-12090.
[27] LI X，SUN X，MENG Y，et al.Dice loss for data-imbalanced NLP tasks[J].arXiv：1911.02855，2019.
[28] DE BOER P T，KROESE D P，MANNOR S，et al.A tutorial on the cross-entropy method[J].Annals of Operations Research，2005，134（1）：19-67.
[29] ZHOU B，ZHAO H，PUIG X，et al.Semantic understanding of scenes through the ade20k dataset[J].International Journal of Computer Vision，2019，127（3）：302-321.
[30] CORDTS M，OMRAN M，RAMOS S，et al.The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：3213-3223.
[31] HE K M，ZHANG X Y，REN S Q，et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，June 26-July 1，2016.New York：IEEE Press，2016：770-778.
[32] CAO Y，XU J，LIN S，et al.Gcnet：non-local networks meet squeeze-excitation networks and beyond[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops，2019.
[33] TOUVRON H，CORD M，DOUZE M，et al.Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning，2021：10347-10357.
[34] ZHENG S，LU J，ZHAO H，et al.Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：6881-6890.
[35] CHU X，TIAN Z，WANG Y，et al.Twins：revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems，2021：9355-9366.
[36] HUANG L，YUAN Y，GUO J，et al.Interlaced sparse self-attention for semantic segmentation[J].arXiv：1907.12273，2019.
[37] YUAN Y H，CHEN X K，CHEN X L，et al.Segmentation transformer：object-contextual representations for semantic segmentation[J].arXiv：1909.11065，2019.
[38] SUN K，ZHAO Y，JIANG B，et al.High-resolution representations for labeling pixels and regions[J].arXiv：1904.04514，2019.