Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (15): 234-242.DOI: 10.3778/j.issn.1002-8331.2311-0400

• Graphics and Image Processing • Previous Articles     Next Articles

Remote Sensing Image Semantic Segmentation Network Based on Multimodal Fusion

HU Yuxiang, YU Changhong, GAO Ming   

  1. School of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
  • Online:2024-08-01 Published:2024-07-30

多模态融合的遥感图像语义分割网络

胡宇翔,余长宏,高明   

  1. 浙江工商大学 信息与电子工程学院,杭州 310018

Abstract: Multimodal semantic segmentation networks can utilize complementary data in different modalities to improve segmentation accuracy. However, existing multimodal semantic segmentation models often combine two types of modal data with simple way, neglecting the information features in the high and low frequencies of different modal data, leading to insufficient cross-modal feature extraction and suboptimal fusion. To address these issues, this paper proposes a remote sensing image semantic segmentation network, LHFNet (low feature and high feature fusion network), which fuses IRRG images with DSM images. Firstly, for the structure features related to the low frequency of each modal image, a low-level feature extraction enhancement module is designed to strengthen the extraction of different modal features. Secondly, based on the detail features that are independent of each other in the high frequency of each modal image, a high-level feature fusion module is designed to guide the fusion of different modal features. Finally, for the semantic gap between the high and low frequency image features, a global atrous spatial pyramid pooling module is designed to skip the connection of high and low frequency information, to enhance the information interaction between the high and low frequency image features. Experiments on the Vaihingen and Potsdam datasets provided by ISPRS show that LHFNet has achieved global accuracies of 88.17% and 90.53% respectively, which are higher than the segmentation accuracies of single-mode segmentation networks such as SegNet, DeepLabv3+, and multimodal RGB-D segmentation networks such as RedNet, TSNet.

Key words: multimodal semantic segmentation, feature fusion, feature extraction, multi-source remote sensing images

摘要: 多模态语义分割网络可以利用不同模态中的互补数据提升分割精度,但现有的多模语义分割模型多为将两种模态数据简单拼接起来,忽略了不同模态数据在高低频中的特征特性,导致跨模态特征提取不充分,融合不理想。针对上述问题,提出了将IRRG图像与DSM图像融合的遥感图像语义分割网络LHFNet(low feature and high feature fusion network)。针对各模态图像低频相关的结构特征,设计了低级特征提取加强模块,以加强不同模态特征的提取;基于各模态图像高频相互独立的细节特征,设计了高级特征融合模块,以指导不同模态的特征融合;针对高低频图像特征之间的语义鸿沟,设计了全局空洞空间金字塔池化模块跳跃连接高低频信息,以增强高低频图像特征之间的信息交互。通过基于ISPRS提供的Vaihingen和Potsdam数据集上的实验表明,LHFNet分别取得了88.17%和90.53%的全局精确度,相较于SegNet、DeepLabv3+等单模分割网络和RedNet、TSNet等多模RGB-D分割网络都具有更高的分割精确度。

关键词: 多模语义分割, 特征融合, 特征提取, 多源遥感图像