Dual-Modal Feature Fusion Semantic Segmentation of RGB-D

doi:10.3778/j.issn.1002-8331.2111-0518

Abstract

Abstract: The existing RGB image semantic segmentation network for complex indoor scenes is susceptible to factors such as color and lighting, while it is also challenging to integrate dual-modal features effectively. Regarding the issue indicated above, this paper proposes an attention mechanism bimodal fusion network（AMBFNet） that adopts an encoder-decoder structure. In the first phase, building the bimodal fusion network structure（AMBF） is carried out to reasonably allocate the location and channel information of the features at each stage of the encoding branch. And then, designing the DA-context module is implemented to merge the context information. Finally, the multi-scale feature maps are cross-layer fused through the decoder to reduce the problem of misrecognition between classes and the loss of small-scale targets in the prediction results. The test results on the two public datasets of SUN RGB-DNYU and Depth v2（NYUDV2） show the consequence that compared with the more advanced RGB-D semantic segmentation network such as the RedNet, ACNet and ESANet, under the same hardware conditions, the network proposed in this paper has better segmentation performance. At the same time, the MIoU reaches 47.9% and 50.0%, respectively.

Key words: attention mechanism, dual modal feature fusion, dual attention perception context, RGB-D semantic segmentation

摘要： 针对复杂室内场景中，现有RGB图像语义分割网络易受颜色、光照等因素影响以及RGB-D图像语义分割网络难以有效融合双模态特征等问题，提出一种基于注意力机制的RGB-D双模态特征融合语义分割网络AMBFNet（attention mechanism?bimodal?fusion?network）。该网络采用编-解码器结构，首先搭建双模态特征融合结构（AMBF）来合理分配编码支路各阶段特征的位置与通道信息，然后设计双注意感知的上下文（DA-context）模块以合并上下文信息，最后通过解码器将多尺度特征图进行跨层融合，以减少预测结果中类间误识别和小尺度目标丢失问题。在SUN?RGB-DNYU和NYU Depth v2（NYUDV2）两个公开数据集上的测试结果表明，相较于残差编解码（RedNet）、注意力互补网络（ACNet）、高效场景分析网络（ESANet）等目前较先进的RGB-D语义分割网络，在同等硬件条件下，该网络具有更好的分割性能，平均交并比（MIoU）分别达到了47.9%和50.0%。

关键词: 注意力机制, 双模态特征融合, 双重注意感知上下文, RGB-D语义分割

LUO Penlin, FANG Yanhong, LI Xin, LI Xue. Dual-Modal Feature Fusion Semantic Segmentation of RGB-D[J]. Computer Engineering and Applications, 2023, 59(7): 222-231.

罗盆琳, 方艳红, 李鑫, 李雪. RGB-D双模态特征融合语义分割[J]. 计算机工程与应用, 2023, 59(7): 222-231.

References

[1] WANG W，FU Y，PAN Z，et al.Real-time driving scene semantic segmentation[J].IEEE Access，2020，8：36776-36788.
[2] LUO R C，CHIOU M.Hierarchical semantic mapping using convolutional neural networks for intelligent service robotics[J].IEEE Access，2018，6：61287-61294.
[3] CHEN L，BENTLEY P，MORI K，et al.DRINet for medical image segmentation[J].IEEE Transactions on Medical Imaging，2018，37（11）：2453-2462.
[4] XIANG S，XIE Q，WANG M.Semantic segmentation for remote sensing images based on adaptive feature selection network[J].IEEE Geoscience and Remote Sensing Letters，2021，19：8006705.
[5] INACIO A D S，LOPES H S.EPYNET：efficient pyramidal network for clothing segmentation[J].IEEE Access，2020，8：187882-187892.
[6] BOYKOV Y，VEKSLER O，ZABIH R.Fast approximate energy minimization via graph cuts[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2001，23（11）：1222-1239.
[7] BENSON H Y，SHANNO D F.An exact primal-dual penalty method approach to warmstarting interior-point methods for linear programming[J].Computational Optimization and Applications，2007，38（3）：371-399.
[8] LAFFERTY J，MCCALLUM A，PEREIRA F C N.Conditional random fields：probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning，2001：282-289.
[9] LARLUS D，JURIE F.Combining appearance models and markov random fields for category level object segmentation[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition，2008：1-7.
[10] HINTON G E，SALAKHUTDINOV R R.Reducing the dimensionality of data with neural networks[J].Science，2006，313（5786）：504-507.
[11] LONG J，SHELHAMER E，DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3431-3440.
[12] KRIZHEVSKY A，SUSKEVER I，HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems，2012：1097-1105.
[13] SIMONYAN K，ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv：1409.1556，2014.
[14] SZEGEDY C，LIU W，JIA Y，et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：1-9.
[15] DONAHUE J，JIA Y，VINYALS O，et al.Decaf：a deep convolutional activation feature for generic visual recognition[C]//Proceedings of International Conference on Machine Learning，2014：647-655.
[16] RONNEBERGER O，FISCHER P，BROX T.U-net：convolutional networks for biomedical image segmentation[C]//Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention，2015：234-241.
[17] HAZIRBAS C，MA L，DOMOKOS C，CREMER D.FuseNet：incorporating depth into semantic segmentation via fusion-based CNN architecture[C]//Proceedings of Asian Conference on Computer Vision（ACCV），2016：213-228.
[18] JIANG J，ZHENG L，LUO F，et al.Rednet：residual encoder-decoder network for indoor RGB-D semantic segmentation[J].arXiv：1806.01054，2018.
[19] ZHONG Y，DAI Y，LI H.3D geometry-aware semantic labeling of outdoor street scenes[C]//Proceedings of the 2018 24th International Conferenceon Pattern Recognition（ICPR），2018：2343-2349.
[20] XING Y，WANG J，ZENG G.Malleable 2.5 dconvolution：learning receptive fields along the depth-axis for rgb-dscene parsing[C]//Proceedings of 16th European Conference on Computer Vision，2020：555-571.
[21] WANG W，NEUMANN U.Depth-aware cnn for rgb-d segmentation[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：135-150.
[22] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[23] CAO Y，XU J，LIN S，et al.GCNET：non-local networks meet squeeze-excitation networks and beyond[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops，2019.
[24] LIU J J，HOU Q，CHENG M M，et al.Improving convolutional networks with self-calibrated convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：10096-10105.
[25] HOU Q，ZHOU D，FENG J.Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：13713-13722.
[26] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[27] ROMERA E，ALVAREZ J M，BERGASA L M，et al.Erfnet：efficient residual factorized convnet for real-time semantic segmentation[J].IEEE Transactions on Intelligent Transportation Systems，2017，19（1）：263-272.
[28] FU J，LIU J，TIAN H，et al.Dual attention network for scene segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：3146-3154.
[29] SEICHTER D，K?HLER M，LEWANDOWSKI B，et al.Efficient RGB-D semantic segmentation for indoor scene analysis[C]//Proceedings of the 2021 IEEE International Conference on Robotics and Automation（ICRA），2021：13525-13531.
[30] SILBERMAN N，HOIEM D，KOHLI P，et al.Indoor segmentation and support inference from RGBD images[C]//Proceedings of the European Conference on Computer Vision，2012：746-760.
[31] SONG S，LICHTENBERG S P，XIAO J.Sun RGB-D：a RGB-D scene understanding benchmark suite[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：567-576.
[32] BORNSCHEIN J，VISIN F，OSINDERO S.Small data，big decisions：model selection in the small-data regime[C]//Proceedings of the International Conference on Machine Learning，2020：1035-1044.
[33] 李鑫，张红英，刘汉玉.融合多尺度和边界优化的图像语义分割网络[J].计算机工程与应用，2022，58（21）：250-257.
LI X，ZHANG H Y，LIU H Y.Image semantic segmentation network fusing multi-scale and boundary optimization[J].Computer Engineering and Applications，2022，58（21）：250-257.
[34] PARK S J，HONG K S，LEE S.RDFNet：RGB-D multi-level residual feature fusion for indoor semantic segmentation[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：4980-4989.
[35] HU X，YANG K，FEI L，et al.ACNET：attention based network to exploit complementary features for RGBD semantic segmentation[C]//Proceedings of 2019 IEEE International Conference on Image Processing（ICIP），2019：1440-1444.
[36] XING Y，WANG J，CHEN X，et al.2.5D convolution for RGB-D semantic segmentation[C]//Proceedings of 2019 IEEE International Conference on Image Processing（ICIP），2019：1410-1414.
[37] FOOLADGAR F，KASAEI S.Multi-modal attention-based fusion model for semantic segmentation of RGB-depth images[J].arXiv：1912.11691，2019.