基于窗口注意力聚合Swin Transformer的无人机影像语义分割方法

doi:10.3778/j.issn.1002-8331.2305-0251

摘要/Abstract

摘要： 采用无人机遥感影像进行地物分类的过程中，由于无人机影像的小尺寸地物目标不够突出和无人机影像背景复杂、地物信息难以辨别等问题，采用现行的经典语义分割方法难以获得理想的地物分类效果。该研究以Swin Transformer网络模型为基础，提出了基于窗口注意力聚合Swin Transformer（window attention aggregation Swin Transformer，WAA SwinT）的语义分割网络模型方法。采用了多窗口注意力聚合的方式来进行更精准的注意力计算，以提升无人机遥感影像中的小尺寸地物目标的分类精度和质量。同时借鉴嵌入连接的思想，采用多级特征嵌入连接解码器改善网络结构，应用于无人机遥感影像的分割中，取得了更精细化的分割效果。为了验证提出的方法在无人机影像语义分割中的效果，分别在城市无人机遥感影像UAVid数据集和UDD数据集进行了实验，并与现行的经典语义分割方法进行了对比。实验结果表明，语义分割方法在UAVid数据集和UDD数据集上均可以得到最佳的语义分割效果。同时，该语义分割方法能显著地提升无人机影像中小尺寸地物精准分割的质量。

关键词: 无人机影像, 语义分割, Swin Transformer, 窗口注意力聚合

Abstract: In the process of ground object classification using UAV remote sensing image, the existing classical semantic segmentation method is difficult to obtain ideal ground object classification effect due to the small size ground object target of UAV image and complex UAV image background, indistinguishable ground object texture information. In this study, based on Swin Transformer, the window attention aggregation Swin Transformer (WAA SwinT) semantic segmentation method is presented. Aiming at those problems, the idea of multi-window attention aggregation is used for more accurate attention calculation to deal with the classification of small size ground objects in UAV remote sensing images. Based on the idea of reference nested connection, the multilevel feature nested connection decoder is applied to UAV remote sensing image segmentation, and the segmentation effect is fine. In order to verify the effectiveness of the proposed semantic segmentation method in UAV image semantic segmentation, experiments are carried out on urban UAV remote sensing image UAVid dataset and UDD dataset, and compared with the current classical semantic methods. The experimental results show that the segmentation method can get the best semantic segmentation effect on both UAVid dataset and UDD dataset. At the same time, the semantic segmentation method can effectively overcome the problem that it is difficult to accurately segment the small size objects and texture of UAV images.

Key words: UAV image, semantic segmentation, Swin Transformer, window attention aggregation

李俊杰, 易诗, 何润华, 刘茜. 基于窗口注意力聚合Swin Transformer的无人机影像语义分割方法[J]. 计算机工程与应用, 2024, 60(15): 198-210.

LI Junjie, YI Shi, HE Runhua, LIU Xi. Semantic Segmentation Method of UAV Image Based on Window Attention Aggregation Swin Transformer[J]. Computer Engineering and Applications, 2024, 60(15): 198-210.

参考文献

[1] 孙汉淇, 潘晨, 何灵敏, 等. 多模态特征融合的遥感图像语义分割网络[J]. 计算机工程与应用, 2022, 58(24): 256-264.
SUN H Q, PAN C, HE L M, et al. Remote sensing image semantic segmentation network based on multimodal feature fusion[J]. Computer Engineering and Applications, 2022, 58(24): 256-264.
[2] 李道纪, 郭海涛, 卢俊, 等. 遥感影像地物分类多注意力融和U型网络法[J]. 测绘学报, 2020, 49(8): 1051-1064.
LI D J, GUO H T, LU J, et al. A remote sensing image classification procedure based on multilevel attention fusion U-Net[J]. Acta Geodaeticaet Cartographica Sinica, 2020, 49(8): 1051-1064.
[3] 左宗成, 张文, 张东映. 融合可变形卷积与条件随机场的遥感影像语义分割方法[J]. 测绘学报, 2019, 48(6): 718-726.
ZUO Z C, ZHANG W, ZHANG D Y. A remote sensing image semantic segmentation method by combining deformable convolution with conditional random fields[J]. Acta Geodaeticaet Cartographica Sinica, 2019, 48(6): 718-726.
[4] GARCIA-GARCIA A, ORTS-ESCOLANO S, OPREA S, et al. A review on deep learning techniques applied to semantic segmentation[J]. arXiv:1704.06857, 2017.
[5] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3431-3440.
[6] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[J]. arXiv:1505.04597, 2015.
[7] 刘腊梅, 王晓娜, 刘万军, 等. 融合转置卷积与深度残差图像语义分割方法[J]. 计算机科学与探索, 2022, 16(9): 2132-2142.
LIU L M, WANG X N, LIU W J, et al. Image semantic segmentation method with fusion of transposed convolution and deep residual[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2132-2142.
[8] 欧阳柳, 贺禧, 瞿绍军. 全卷积注意力机制神经网络的图像语义分割[J]. 计算机科学与探索, 2022, 16(5): 1136-1145.
OUYANG L, HE X, QU S J. Fully convolutional neural network with attention module for semantic segmentation[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(5): 1136-1145.
[9] ZHANG X, XU H, MO H, et al. Dcnas: densely connected neural architecture search for semantic image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13956-13967.
[10] MI L, CHEN Z. Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 140-152.
[11] 张哲晗, 方薇, 杜丽丽, 等. 基于编码-解码卷积神经网络的遥感图像语义分割[J]. 光学学报, 2020, 40(3): 46-55.
ZHANG Z H, FANG W, DU L L, et al. Semantic segmentation of remote sensing image based on encoder-decoder convolutional neural network[J]. Acta Optica Sinica, 2020, 40(3): 46-55.
[12] LI Y, CHEN W, HUANG X, et al. MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation[J]. Science China Information Sciences, 2023, 66(4): 1-14.
[13] ZHANG J, LIN S, DING L, et al. Multi-scale context aggregation for semantic segmentation of remote sensing images[J]. Remote Sensing, 2020, 12(4): 701.
[14] HE C, LI S, XIONG D, et al. Remote sensing image semantic segmentation based on edge information guidance[J]. Remote Sensing, 2020, 12(9): 1501.
[15] SUN X, SHI A, HUANG H, et al. BAS4 Net: boundary-aware semi-supervised semantic segmentation network for very high resolution remote sensing images[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13: 5398-5413.
[16] CUI W, HE X, YAO M, et al. Knowledge and spatial pyramid distance-based gated graph attention network for remote sensing semantic segmentation[J]. Remote Sensing, 2021, 13(7): 1312.
[17] OUYANG S, LI Y. Combining deep semantic segmentation network and graph convolutional neural network for semantic segmentation of remote sensing imagery[J]. Remote Sensing, 2021, 13(1): 119.
[18] ZHAO J, ZHANG D, SHI B, et al. Multi-source collaborative en-hanced for remote sensing images semantic segmentation[J]. Neurocomputing, 2022, 493: 76-90.
[19] YANG M D, TSENG H H, HSU Y C, et al. Semantic segmentation using deep learning with vegetation indices for rice lodging identification in multi-date UAV visible images[J]. Remote Sensing, 2020, 12(4): 633.
[20] 石敏, 沈佳林, 易清明, 等. 快速超轻量城市交通场景语义分割[J]. 计算机科学与探索, 2022, 16(10): 2377-2386.
SHI M, SHEN J L, YI Q M, et al. Rapid and ultra-lightweight semantic segmentation in urban traffic scene[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(10): 2377-2386.
[21] LOBO TORRES D, QUEIROZ FEITOSA R, NIGRI HAPP P, et al. Applying fully convolutional architectures for semantic segmentation of a single tree species in urban environment on high resolution UAV optical imagery[J]. Sensors, 2020, 20(2): 563.
[22] BOONPOOK W, TAN Y, XU B. Deep learning-based multi-feature semantic segmentation in building extraction from images of UAV photogrammetry[J]. International Journal of Remote Sensing, 2021, 42(1): 1-19.
[23] KONG Y, ZHANG B, YAN B, et al. Affiliated fusion conditional random field for urban UAV image semantic segmentation[J]. Sensors, 2020, 20(4): 993.
[24] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[25] WANG W, XIE E, LI X, et al. Pyramid vision Transformer: a versatile backbone for dense prediction without convolutions[J]. arXiv:2102.12122, 2021.
[26] ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 6881-6890.
[27] XIE E, WANG W, YU Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[J]. arXiv:2105.15203, 2021.
[28] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows[J]. arXiv:2103.
14030, 2021.
[29] ZHOU Z, SIDDIQUEE M M R, TAJBAKHSH N, et al. Unet++: a nested u-net architecture for medical image segmentation[M]//Deep learning in medical image analysis and multimodal learning for clinical decision support. Cham: Springer, 2018: 3-11.
[30] LYU Y, VOSSELMAN G, XIA G S, et al. UAVid: a semantic segmentation dataset for UAV imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 165: 108-119.
[31] CHEN Y, WANG Y, LU P, et al. Large-scale structure from motion with semantic constraints of aerial images[C]//Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Cham: Springer, , 2018: 347-359.