结合CNN-Transformer的跨模态透明物体分割

doi:10.3778/j.issn.1002-8331.2310-0064

摘要/Abstract

摘要： 透明物体具有高透明度、光泽度和特殊质地等视觉特性，这些特性使得物体与背景之间的边界往往模糊不清，导致传统的图像分割算法难以准确识别和分割，因此提出结合CNN-Transformer的跨模态透明物体语义分割算法CTNet。该算法采用CNN和Transformer混合网络的编码-解码结构跨模态对透明物体类别和位置进行预测，CNN用于提取图像特征，Transformer用于多模态融合（multimodal fusion transformer，MFT）；设计边界特征增强注意力模块（enhanced boundary attention module，EBAM），提升图像边缘分割能力；提出多尺度融合解码结构，减少模糊特征。CTNet在RGB-T-Glass数据集上的平均绝对误差（mean absolute error，MAE）为3.3%，交并比（intersection over union，IOU）在包含透明物体和不含透明物体的测试集上分别为90.18%和95.00%；在GDD数据集上，MAE为6.9%，IOU为87.6%。实验结果表明，CTNet利用可见光和热红外图像成功实现了对透明物体的准确分割，满足目标任务中对透明物体分割时的精确性和鲁棒性要求。

关键词: CNN-Transformer, 多模态, 透明物体, 语义分割, 特征融合

Abstract: Transparent objects have visual characteristics such as high transparency, glossiness and special texture, which make the boundary between the object and the background often blurred, making it difficult for traditional image segmentation algorithms to accurately recognize and segment them, so this paper proposes a cross-modal semantic segmentation algorithm for transparent objects, CTNet, which combines CNN-Transformer. The algorithm adopts the encoding-decoding structure of CNN and Transformer hybrid network to predict the category and location of transparent objects across modalities, CNN is used to extract image features, and Transformer is used for multimodal fusion transformer (MFT). The enhanced boundary attention module (EBAM) is designed to improve the image edge segmentation ability. A multi-scale fusion decoding structure is proposed to reduce the blurred features. The mean absolute error (MAE) of CTNet is 3.3% in the RGB-T-Glass dataset, and the intersection over union (IOU) is 90.18% and 95.00% in the test sets with transparent objects and without transparent objects, respectively. On the GDD dataset, the MAE is 6.9% and the IOU is 87.6%. The results show that CTNet successfully realizes accurate segmentation of transparent objects using visible and thermal infrared images, and meets the requirements of accuracy and robustness when segmenting transparent objects in the target task.

Key words: CNN-Transformer, multimodal, transparent objects, semantic segmentation, feature fusion

潘惟兰, 张荣芬, 刘宇红, 张吉友, 孙龙. 结合CNN-Transformer的跨模态透明物体分割[J]. 计算机工程与应用, 2025, 61(4): 222-229.

PAN Weilan, ZHANG Rongfen, LIU Yuhong, ZHANG Jiyou, SUN Long. Cross-Modal Transparent Object Segmentation Combining CNN-Transformer[J]. Computer Engineering and Applications, 2025, 61(4): 222-229.

参考文献

[1] ZHANG J M, YANG K L, CONSTANTINESCU A, et al. Trans4Trans: efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2021: 1760-1770.
[2] HOU X N, ZHAN M H, WANG C L, et al. Glass objects detection based on transformer encoder-decoder[C]//Proceedings of the 2022 6th International Conference on Automation, Control and Robots. Piscataway: IEEE, 2022: 217-223.
[3] XU Z G, LAI B S, YUAN L, et al. Real-time transparent object segmentation based on improved DeepLabv3[C]//Proceedings of the 2021 China Automation Congress. Piscataway: IEEE, 2021: 4310-4315.
[4] MEI H Y, YANG X, WANG Y, et al. Don’t hit me! Glass detection in real-world scenes[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 3684-3693.
[5] HE H, LI X T, CHENG G L, et al. Enhanced boundary learning for glass-like object segmentation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 15839-15848.
[6] XIE E Z, WANG W J, WANG W H, et al. Segmenting transparent objects in the wild[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 696-711.
[7] JIANG J Q, CAO G Q, BUTTERWORTH A, et al. Where shall I touch? Vision-guided tactile poking for transparent object grasping[J]. IEEE/ASME Transactions on Mechatronics, 2023, 28(1): 233-244.
[8] WANG Y R, ZHAO Y C, XU H P, et al. MVTrans: multi-view perception of transparent objects[C]//Proceedings of the 2023 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2023: 3771-3778.
[9] KOSUGE A, YU L X, HAMADA M, et al. A deep metric learning-based anomaly detection system for transparent objects using polarized-image fusion[J]. IEEE Open Journal of the Industrial Electronics Society, 2023, 4: 205-213.
[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008.
[11] 李清格, 杨小冈, 卢瑞涛, 等. 计算机视觉中的Transformer发展综述[J]. 小型微型计算机系统, 2023, 44(4): 850-861.
LI Q G, YANG X G, LU R T, et al. Transformer in computer vision: a survey[J]. Journal of Chinese Computer Systems, 2023, 44(4): 850-861.
[12] 刘文婷, 卢新明. 基于计算机视觉的Transformer研究进展[J]. 计算机工程与应用, 2022, 58(6): 1-16.
LIU W T, LU X M. Research progress of transformer based on computer vision[J]. Computer Engineering and Applications, 2022, 58(6): 1-16.
[13] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002.
[14] SHI W T, XU J, GAO P. SSformer: a lightweight transformer for semantic segmentation[C]//Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing. Piscataway: IEEE, 2022: 1-5.
[15] LI G Y, LIU Z, CHEN M Y, et al. Hierarchical alternate interaction network for RGB-D salient object detection[J]. IEEE Transactions on Image Processing, 2021, 30: 3528-3542.
[16] LI G Y, LIU Z, LING H B. ICNet: information conversion network for RGB-D based salient object detection[J]. IEEE Transactions on Image Processing, 2020, 29: 4873-4884.
[17] HA Q S, WATANABE K, KARASAWA T, et al. MFNet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes[C]//Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway: IEEE, 2017: 5108-5115.
[18] HUO D, WANG J, QIAN Y M, et al. Glass segmentation with RGB-thermal image pairs[J]. IEEE Transactions on Image Processing, 2023, 32: 1911-1926.
[19] 孙旭辉, 官铮, 王学. 红外与可见光图像分组融合的视觉Transformer[J]. 中国图象图形学报, 2023, 28(1): 166-178.
SUN X H, GUAN Z, WANG X. Vision transformer for fusing infrared and visible images in groups[J]. Journal of Image and Graphics, 2023, 28(1): 166-178.
[20] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[21] LEE Y, PARK J. CenterMask: real-time anchor-free instance segmentation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 13903-13912.
[22] PRAKASH A, CHITTA K, GEIGER A. Multi-modal fusion transformer for end-to-end autonomous driving[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 7073-7083.
[23] 田启川, 孟颖. 多尺度融合增强的图像语义分割算法[J]. 计算机工程与应用, 2021, 57(2): 177-185.
TIAN Q C, MENG Y. Image semantic segmentation algorithm with multi-scale feature fusion and enhancement[J]. Computer Engineering and Applications, 2021, 57(2): 177-185.
[24] 吴宁, 罗杨洋, 许华杰. 基于多尺度特征融合的遥感图像语义分割方法[J]. 计算机应用, 2024, 44(3): 737-744.
WU N, LUO Y Y, XU H J. Semantic segmentation method for remote sensing images based on multi-scale feature fusion[J]. Journal of Computer Applications, 2024, 44(3): 737-744.
[25] JI W, LI J J, ZHANG M, et al. Accurate RGB-D salient object detection via collaborative learning[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 52-69.
[26] ZHANG M, REN W S, PIAO Y R, et al. Select, supplement and focus for RGB-D saliency detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 3469-3478.
[27] ZHANG J, FAN D P, DAI Y C, et al. UC-net: uncertainty inspired RGB-D saliency detection via conditional variational autoencoders[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 8579-8588.
[28] ZHAO X Q, ZHANG L H, PANG Y W, et al. A single stream network for robust and real-time RGB-D salient object detection[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 646-662.
[29] STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 7242-7252.
[30] CAO J M, LENG H C, LISCHINSKI D, et al. ShapeConv: shape-aware convolutional layer for indoor RGB-D semantic segmentation[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 7068-7077.
[31] YANG X, MEI H Y, XU K, et al. Where is my mirror?[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 8808-8817.