计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (1): 252-262.DOI: 10.3778/j.issn.1002-8331.2308-0315

• 图形图像处理 • 上一篇    下一篇

DCaT:面向高分辨率场景的轻量级语义分割模型

黄科迪,黄鹤鸣,李伟,樊永红   

  1. 1.青海师范大学 计算机学院,西宁 810008
    2.藏语智能信息处理及应用国家重点实验,西宁 810008
  • 出版日期:2025-01-01 发布日期:2024-12-31

DCaT: Lightweight Semantic Segmentation Model for High-Resolution Scenes

HUANG Kedi, HUANG Heming, LI Wei, FAN Yonghong   

  1. 1.School of Computer Science and Technology, Qinghai Normal University, Xining 810008, China
    2.State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China
  • Online:2025-01-01 Published:2024-12-31

摘要: 语义分割是计算机视觉中分析和理解场景的关键任务,但现有的分割模型需要较高的计算成本和内存需求,不适合高分辨率场景的轻量级语义分割。针对该问题,提出了一种新的面向高分辨率场景的轻量级语义分割模型DCaT。采用深度可分离卷积提取图像的局部语义;使用基于坐标感知和动态稀疏混合注意力的轻量级Transformer获取图像的全局语义;通过模块融合,在低级语义上注入高级语义;通过分割头输出像素预测标签。实验结果表明:与基线模型相比,DCaT在高分辨率数据集Cityscapes上的平均交并比提高了1.5个百分点,模型复杂度降低了26%,推理速度提升了12%。实现了高分辨率场景下模型复杂度与性能之间的更好平衡,证明了DCaT的有效性和实用性。

关键词: 语义分割, 轻量化, 高分辨率, Transformer, 稀疏注意力

Abstract: Semantic segmentation is a critical task in computer vision for analyzing and understanding scenes. However, existing segmentation models require high computational costs and memory demands, which makes them unsuitable for lightweight semantic segmentation in high-resolution scenes. To address this issue, a novel lightweight semantic segmentation model called DCaT has been proposed, specifically designed for high-resolution scenes. First, the model extracts the local low-level semantics of the image using deep separable convolution; second, the global high-level semantics of the image is obtained using a lightweight Transformer based on coordinate-aware and dynamic sparse mixed attention; then, the high-level semantics are injected into low-level semantics through the fusion module; and lastly, pixel prediction labels are outputted through the segmentation head. The experimental results of DCaT on the high-resolution dataset Cityscapes show that compared to the benchmark model, the mean intersection over union has improved by 1.5 percentage points, the model’s complexity has been reduced by 26%, and the inference speed has increased by 12%. A better balance between model complexity and performance in high-resolution scenarios is achieved, thus demonstrating the effectiveness and practicality of DCaT.

Key words: semantic segmentation, lightweight, high resolution, Transformer, sparse attention