空间可分离注意力的跨尺度编码Transformer遥感图像道路提取方法

doi:10.3778/j.issn.1002-8331.2308-0089

摘要/Abstract

摘要： 遥感图像的道路分割任务是遥感应用领域的一个研究热点，一直受到广泛的关注。由于遥感图像天然具备背景复杂、目标密集等特性，全局语义信息的构建对于准确提取遥感图像中道路是至关重要的。因此，基于Transformer模型进行优化，提出了基于空间可分离注意力的跨尺度令牌嵌入Transformer遥感道路提取模型Cross-RoadFormer。具体而言，针对图像中道路尺度不统一的问题，设计了跨尺度编码层，将不同尺度的特征编码作为一个令牌嵌入整体，作为Transformer的输入，解决了Transformer跨尺度交互的问题；此外，提出了一种空间可分离注意力，其中，局部分组注意力获取细粒度、短距离信息，全局采样注意力捕获长距离、全局上下文信息，在保证道路提取准确度的前提下，降低了模型的计算量。在Massachusetts数据集和DeepGlobe数据集上的实验表明，提出的Cross-RoadFormer都实现了更高的IoU（intersection over union），分别为68.40%和58.04%，展现了该方法的优越性。

关键词: 道路提取, 遥感图像, Transformer, 注意力机制

Abstract: Road segmentation in remote sensing images is a research hotspot in the field of remote sensing applications and has been widely studied. Due to the inherent complexity of background and density of objects in remote sensing images, the construction of global semantic information is crucial for accurately extracting roads in remote sensing images. Therefore, this paper optimizes road extraction in remote sensing images using a Transformer model and proposes a Cross-RoadFormer model based on spatial separable attention. Specifically, to address the issue of non-uniform road scales in images, a cross-scale encoding layer is designed to encode features from different scales into a unified token embedding, which serves as the input to the Transformer and resolves the problem of cross-scale interactions in the Transformer. In addition, a spatial separable attention mechanism is proposed, where local group attention captures fine-grained, short-distance information, and global sampled attention captures long-distance and global contextual information. This reduces the computational burden of the model while ensuring the accuracy of road extraction. Experimental results on the Massachusetts dataset and the DeepGlobe dataset show that the proposed Cross-RoadFormer achieves higher Intersection over union (IoU) values of 68.40 percent and 58.04 percent, respectively, demonstrating the superiority of the method presented in this paper.

Key words: road extraction, remote sensing image, Transformer, attention mechanism

田青, 张瑶, 张正, 吕其修. 空间可分离注意力的跨尺度编码Transformer遥感图像道路提取方法[J]. 计算机工程与应用, 2024, 60(23): 219-228.

TIAN Qing, ZHANG Yao, ZHANG Zheng, LYU Qixiu. Spatially Separable Attention Transformer with Cross-Scale Encoding for Remote Sensing Image Road Extraction[J]. Computer Engineering and Applications, 2024, 60(23): 219-228.

参考文献

[1] 张永宏, 何静, 阚希，等. 遥感图像道路提取方法综述[J]. 计算机工程与应用, 2018, 54(13): 1-10.
ZHANG Y H, HE J, KAN X, et al. Summary of road extraction methods for remote sensing images[J]. Computer Engineering and Applications, 2018, 54(13): 1-10.
[2] CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[J]. arXiv:1706.05587, 2017.
[3] ZHANG Z, MIAO C, LIU C A, et al. DCS-TransUperNet: road segmentation network based on CSwin transformer with dual resolution[J]. Applied Sciences, 2022, 12(7): 3511.
[4] FUKUI H, HIRAKAWA T, YAMASHITA T, et al. Attention branch network: learning of attention mechanism for visual explanation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10705-10714.
[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2010.
[6] 左娟, 李勇军. 结合纹理与形状特征的高分辨率遥感影像道路提取[J]. 测绘, 2013(3): 111-113.
ZUO J, LI Y J. Road extraction from high resolution remote sensing images combining texture and shape feature[J]. Surveying and Mapping, 2013(3): 111-113.
[7] LI Y, GUO L, RAO J, et al. Road segmentation based on hybrid convolutional network for high-resolution visible remote sensing image[J]. Remote Sensing Letters, 2018, 16(4): 613-617.
[8] SGHAIER M O, LEPAGE R. Road extraction from very high resolution remote sensing optical images based on texture analysis and beamlet transform[J]. Remote Sensing, 2015, 9(5): 1946-1958.
[9] LIU Y, YAO J, LU X, et al. RoadNet: learning to comprehensively analyze road networks in complex urban scenes from high-resolution remotely sensed images[J]. Remote Sensing, 2018, 57(4): 2043-2056.
[10] HENRY C, AZIMI S M, MERKLE N, et al. Road segmentation in SAR satellite images with deep fully convolutional neural networks[J]. IEEE Geoscience, 2018, 15(12): 1867-1871.
[11] CHENG G L, WANG Y, XU S B, et al. Automatic road detection and centerline extraction via cascaded end-to-end convolutional neural network[J]. IEEE Transactions on Geoscience, 2017, 55(6): 3322-3337.
[12] WEI Y, ZHANG K, JI S P, et al. Simultaneous road surface and centerline extraction from large-scale remote sensing images using CNN-based segmentation and tracing[J]. Remote Sensing, 2020, 58(12): 8919-8931.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. arXiv:1706.03762, 2017.
[14] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2117-2125.
[15] ZHANG C, WAN H, LIU S, et al. PVT: point-voxel Transformer for 3D deep learning[J]. arXiv:2108.06076, 2021.
[16] HEO B, YUN S, HAN D, et al. Rethinking spatial dimensions of vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 11936-11945.
[17] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token VIT: training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 558-567.
[18] CHEN C F R, FAN Q, PANDA R. CrossViT: cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 357-366.
[19] ZHANG Z, MIAO C, LIU C, et al. HA-RoadFormer: hybrid attention transformer with multi-branch for large-scale high-resolution dense road segmentation[J]. Mathematics, 2022, 10(11): 1915.
[20] BROGGI A. Parallel and local feature extraction: a real-time approach to road boundary detection[J]. IEEE Transactions on Image Processing, 1995, 4(2): 217-223.
[21] LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[22] DONG X, BAO J, CHEN D, et al. CSWin Transformer: a general vision transformer backbone with cross-shaped windows[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 12124-12134.
[23] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015: 234-241.
[24] LEVY J I, HOUSEMAN E A, SPENGLER J D, et al. Fine particulate matter and polycyclic aromatic hydrocarbon concentration patterns in Roxbury, Massachusetts: a community-based GIS analysis[J]. Environmental Health Perspectives, 2001, 109(4): 341-347.
[25] BELLA G, MASSACCI F, PAULSON L C. An overview of the verification of SET[J]. International Journal of Information Security, 2005, 4: 17-28.
[26] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[27] WAN J, XIE Z, XU Y, et al. DA-RoadNet: a dual-attention network for road extraction from high resolution satellite imagery[J]. Remote Sensing, 2021, 14: 6302-6315.
[28] PANBOONYUEN T, JITKAJORNWANICH K, LAWAWIROJWONG S, et al. Road segmentation of remotely-sensed images using deep convolutional neural networks with landscape metrics and conditional random fields[J]. Remote Sensing 2017, 9(7): 680.
[29] ZHOU L, ZHANG C, WU M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018: 182-186.