计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (18): 263-272.DOI: 10.3778/j.issn.1002-8331.2407-0131

• 图形图像处理 • 上一篇    下一篇

融合局部和全局特征的改进Transformer工业图像分类算法

王玲,崔志瑜,黄靖,王鹏,白燕娥   

  1. 长春理工大学 计算机科学技术学院,长春 130022
  • 出版日期:2025-09-15 发布日期:2025-09-15

Improved Transformer Industrial Image Classification Algorithm Fusing Local and Global Feature

WANG Ling, CUI Zhiyu, HUANG Jing, WANG Peng, BAI Yan'e   

  1. School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China
  • Online:2025-09-15 Published:2025-09-15

摘要: 在数据获取受限、环境复杂且光照变化大的工业场景中,ViT模型的分类准确率仍有待提高。针对该问题,基于CMT模型提出一种工业图像分类算法。改进Patch Embedding模块,通过添加仿射变换和连续卷积块,提升模型对小数据集的泛化能力;改进CMT Block,提出并行局部特征提取模块,增强模型对局部特征的提取能力,将多头自注意力替换为token交互注意力,提升模型的全局特征表达能力,将深度卷积和通道注意力集成到前馈神经网络中,使模型能够有效地捕获相邻特征;提出特征融合模块,将局部和全局特征融合到一起,丰富特征表示,增强模型在小数据集上的分类性能。在自制灌装桶数据集、公开Car Parts和Tiny ImageNet数据集上的实验表明,改进CMT模型的Top-1 Accuracy较CMT模型提升4.7、6.9和5.2个百分点,Macro F1较CMT模型提升0.057、0.071和0.048,实现了分类精度的提高。

关键词: ViT模型, 工业图像分类, CMT模型, 注意力, 特征融合

Abstract: The industrial images have some character, such as limited data acquisition, complex environments and variable lighting conditions, the classification accuracy of ViT model remains suboptimal. To address this issue, an industrial image classification algorithm is proposed based on the CMT model. Firstly, the Patch Embedding module is enhanced by incorporating affine transformations and sequential convolutional blocks, improving the  generalization capability of model on small datasets. Subsequently, the CMT Block is refined by introducing a parallel local feature extraction module, which enhances the model??s ability to capture local features. The multi-head self-attention mechanism is replaced with a token interaction attention mechanism to improve the model??s global feature representation. Deep convolution and channel attention are then integrated into the feedforward neural network, enabling the model to effectively capture neighboring features. Finally, a feature fusion module is proposed to integrate local and global features, enriching the feature representation and enhance classification performance on small datasets. Experimental results on the self-made filling bucket dataset, the public Car Parts dataset, and the Tiny ImageNet dataset demonstrate that the improved CMT model achieves classification accuracy improvements of 4.7, 6.9 and 5.2 persentage points for Top-1 Accuracy over the CMT model and 0.057, 0.071 and 0.048 for Macro F1 over the CMT model.

Key words: ViT model industrial image classification, CMT model, attention, feature fusion