计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (4): 222-229.DOI: 10.3778/j.issn.1002-8331.2310-0064

• 图形图像处理 • 上一篇    下一篇

结合CNN-Transformer的跨模态透明物体分割

潘惟兰,张荣芬,刘宇红,张吉友,孙龙   

  1. 贵州大学 大数据与信息工程学院,贵阳 550025
  • 出版日期:2025-02-15 发布日期:2025-02-14

Cross-Modal Transparent Object Segmentation Combining CNN-Transformer

PAN Weilan, ZHANG Rongfen, LIU Yuhong, ZHANG Jiyou, SUN Long   

  1. College of Big Data & Information Engineering, Guizhou University, Guiyang 550025, China
  • Online:2025-02-15 Published:2025-02-14

摘要: 透明物体具有高透明度、光泽度和特殊质地等视觉特性,这些特性使得物体与背景之间的边界往往模糊不清,导致传统的图像分割算法难以准确识别和分割,因此提出结合CNN-Transformer的跨模态透明物体语义分割算法CTNet。该算法采用CNN和Transformer混合网络的编码-解码结构跨模态对透明物体类别和位置进行预测,CNN用于提取图像特征,Transformer用于多模态融合(multimodal fusion transformer,MFT);设计边界特征增强注意力模块(enhanced boundary attention module,EBAM),提升图像边缘分割能力;提出多尺度融合解码结构,减少模糊特征。CTNet在RGB-T-Glass数据集上的平均绝对误差(mean absolute error,MAE)为3.3%,交并比(intersection over union,IOU)在包含透明物体和不含透明物体的测试集上分别为90.18%和95.00%;在GDD数据集上,MAE为6.9%,IOU为87.6%。实验结果表明,CTNet利用可见光和热红外图像成功实现了对透明物体的准确分割,满足目标任务中对透明物体分割时的精确性和鲁棒性要求。

关键词: CNN-Transformer, 多模态, 透明物体, 语义分割, 特征融合

Abstract: Transparent objects have visual characteristics such as high transparency, glossiness and special texture, which make the boundary between the object and the background often blurred, making it difficult for traditional image segmentation algorithms to accurately recognize and segment them, so this paper proposes a cross-modal semantic segmentation algorithm for transparent objects, CTNet, which combines CNN-Transformer. The algorithm adopts the encoding-decoding structure of CNN and Transformer hybrid network to predict the category and location of transparent objects across modalities, CNN is used to extract image features, and Transformer is used for multimodal fusion transformer (MFT). The enhanced boundary attention module (EBAM) is designed to improve the image edge segmentation ability. A multi-scale fusion decoding structure is proposed to reduce the blurred features. The mean absolute error (MAE) of CTNet is 3.3% in the RGB-T-Glass dataset, and the intersection over union (IOU) is 90.18% and 95.00% in the test sets with transparent objects and without transparent objects, respectively. On the GDD dataset, the MAE is 6.9% and the IOU is 87.6%. The results show that CTNet successfully realizes accurate segmentation of transparent objects using visible and thermal infrared images, and meets the requirements of accuracy and robustness when segmenting transparent objects in the target task.

Key words: CNN-Transformer, multimodal, transparent objects, semantic segmentation, feature fusion