计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (23): 219-228.DOI: 10.3778/j.issn.1002-8331.2308-0089

• 图形图像处理 • 上一篇    下一篇

空间可分离注意力的跨尺度编码Transformer遥感图像道路提取方法

田青,张瑶,张正,吕其修   

  1. 1.北方工业大学 信息学院,北京 100144
    2.苏州玖合智能科技有限公司,江苏 苏州 251100
  • 出版日期:2024-12-01 发布日期:2024-11-29

Spatially Separable Attention Transformer with Cross-Scale Encoding for Remote Sensing Image Road Extraction

TIAN Qing, ZHANG Yao, ZHANG Zheng, LYU Qixiu   

  1. 1.School of Information, North China University of Technology, Beijing 100144, China
    2.Suzhou Jiuhe Intelligent Technology Co., Ltd., Suzhou, Jiangsu 251100, China
  • Online:2024-12-01 Published:2024-11-29

摘要: 遥感图像的道路分割任务是遥感应用领域的一个研究热点,一直受到广泛的关注。由于遥感图像天然具备背景复杂、目标密集等特性,全局语义信息的构建对于准确提取遥感图像中道路是至关重要的。因此,基于Transformer模型进行优化,提出了基于空间可分离注意力的跨尺度令牌嵌入Transformer遥感道路提取模型Cross-RoadFormer。具体而言,针对图像中道路尺度不统一的问题,设计了跨尺度编码层,将不同尺度的特征编码作为一个令牌嵌入整体,作为Transformer的输入,解决了Transformer跨尺度交互的问题;此外,提出了一种空间可分离注意力,其中,局部分组注意力获取细粒度、短距离信息,全局采样注意力捕获长距离、全局上下文信息,在保证道路提取准确度的前提下,降低了模型的计算量。在Massachusetts数据集和DeepGlobe数据集上的实验表明,提出的Cross-RoadFormer都实现了更高的IoU(intersection over union),分别为68.40%和58.04%,展现了该方法的优越性。

关键词: 道路提取, 遥感图像, Transformer, 注意力机制

Abstract: Road segmentation in remote sensing images is a research hotspot in the field of remote sensing applications and has been widely studied. Due to the inherent complexity of background and density of objects in remote sensing images, the construction of global semantic information is crucial for accurately extracting roads in remote sensing images. Therefore, this paper optimizes road extraction in remote sensing images using a Transformer model and proposes a Cross-RoadFormer model based on spatial separable attention. Specifically, to address the issue of non-uniform road scales in images, a cross-scale encoding layer is designed to encode features from different scales into a unified token embedding, which serves as the input to the Transformer and resolves the problem of cross-scale interactions in the Transformer. In addition, a spatial separable attention mechanism is proposed, where local group attention captures fine-grained, short-distance information, and global sampled attention captures long-distance and global contextual information. This reduces the computational burden of the model while ensuring the accuracy of road extraction. Experimental results on the Massachusetts dataset and the DeepGlobe dataset show that the proposed Cross-RoadFormer achieves higher Intersection over union (IoU) values of 68.40 percent and 58.04 percent, respectively, demonstrating the superiority of the method presented in this paper.

Key words: road extraction, remote sensing image, Transformer, attention mechanism