Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (14): 195-205.DOI: 10.3778/j.issn.1002-8331.2412-0150

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Multi-Modal Edge Detection Network Based on Collaboration of CNN and Transformer

LI Yonghui, ZHAO Yao, JIA Xiaohong, WEI Chenzhen, CHANG Wenwen   

  1. 1.School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
    2.Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China
  • Online:2025-07-15 Published:2025-07-15

CNN与Transformer协同的多模态边缘检测网络

李永辉,赵耀,加小红,魏琛珍,常文文   

  1. 1.兰州交通大学 电子与信息工程学院,兰州 730070
    2.北京交通大学 信息科学研究所,北京 100044

Abstract: Edge detection plays a crucial role in computer vision tasks. However, existing edge detection algorithms primarily rely on CNN as encoders, which results in defects in terms of precision, accuracy, and noise handling. To address these issues, a multi-modal edge detection network based on the collaboration of CNN and Transformer is proposed. Firstly, the network employs a high-resolution feature fusion module based on a non-parametric attention residual structure, which retains the low-level properties of the image and enhances global feature representation. Next, a lightweight CNN layer with a multi-scale shuffle attention module is designed to perform gradient encoding, capturing the high-frequency properties of the image. The Transformer architecture is employed to achieve feature encoding and build high-level global dependencies. Then, the features are reconstructed by fusing the high-frequency properties and global dependencies. Finally, multi-scale features from the CNN, Transformer, and high-resolution feature fusion module are progressively aggregated and decoded, enabling high-precision localization of image boundaries. Compared to mainstream algorithms, the proposed model achieves superior metrics on both the BSDS500 and NYUD-v2 datasets.

Key words: edge detection, convolutional neural network (CNN), Transformer, multi-model, deep learning

摘要: 边缘检测在计算机视觉任务中扮演至关重要的角色,然而,现有边缘检测算法主要依赖CNN作为编码器,导致其在精细度、准确性以及噪声处理等方面存在缺陷。为了解决这些问题,提出了一个CNN与Transformer协同的多模态边缘检测网络。设计了一个基于无参数注意力残差结构的高分辨率特征融合模块,保留图像的底层属性,增强全局特征表示;设计了一种包含多尺度混洗注意力模块的轻量化CNN层来完成梯度编码,捕捉图像的高频属性,利用Transformer架构实现特征编码,构建高层的全局依赖关系,通过融合高频属性和全局依赖关系重构特征表达,将CNN、Transformer以及高分辨率特征融合模块的多尺度特征进行逐层聚合解码,从而高精度定位图像边界。与主流算法相比,所提模型在BSDS500、NYUD-v2上均获得较优指标。

关键词: 边缘检测, 卷积神经网络(CNN), Transformer, 多模态, 深度学习