计算机工程与应用 ›› 2026, Vol. 62 ›› Issue (8): 176-188.DOI: 10.3778/j.issn.1002-8331.2507-0389

• 模式识别与人工智能 • 上一篇    下一篇

视觉语言大模型驱动的多模态工业缺陷检测与智能决策

董禹彤1,侯惠芳1+,龚明明2,陈自康1   

  1. 1.河南工业大学 人工智能与大数据学院,郑州 450000
    2.科大讯飞 讯飞聆智人才培养业务部,郑州 450000
    + 通信作者 E-mail:houhuifang@haut.edu.cn
  • 收稿日期:2025-07-30 修回日期:2025-11-19 在线发布日期:2026-04-15 出版日期:2026-04-15
  • 基金资助:
    河南省科技攻关项目(252102221018);河南工业大学校级大学生创新训练计划项目(202510463088)。

Multimodal Industrial Defect Detection and Intelligent Decision-Making Driven by Large Vision-Language Models

DONG Yutong1, HOU Huifang1+, GONG Mingming2, CHEN Zikang1   

  1. 1.School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou 450000, China
    2.Xunfei Lingzhi Talent Training Department, iFlytek, Zhengzhou 450000, China
    + Corresponding author E-mail:houhuifang@haut.edu.cn
  • Received:2025-07-30 Revised:2025-11-19 Online:2026-04-15 Published:2026-04-15

摘要: 缺陷检测是工业领域的重要应用场景。针对传统检测效率低、工业数据需智能分析处理、单模态系统泛化能力不足及大模型存在幻觉等问题,提出并实现一种增强型工业缺陷智检系统,支持多源异构数据协同处理与高效决策。系统微调工业异常检测多模态大模型,通过改进的跨模态特征对齐算法与提示学习,实现图像、文本等多源数据语义融合,同步输出缺陷识别的语义描述;构建工业数据知识库,借助RAG检索增强生成,抑制模型幻觉,提升检测可信度与决策效果;结合Depth-Anything-V2生成高一致性深度图,支持缺陷三维量化分析,突破二维检测局限;基于自然语言驱动的Excel智能分析模块,自动提取质检表格数据并可视化;OCR智检模块融合PaddleOCR与ErnieBot,实现工业文档文本提取与语义理解。通过智能体统筹核心功能模块给出综合决策建议。在金属、螺丝等5类典型工业零件测试中,系统缺陷类型平均识别精确率为95.26%,位置定位平均误差2.9像素,其拓展了工业质检自动化水平与分析维度,为制造业智能化转型提供了实用技术方案。

关键词: 工业缺陷检测, 多模态融合, 大视觉语言模型(LVLM), 深度图分析, 检索提高生成(RAG), 数据决策, 智能体

Abstract: Defect detection is an important application scenario in the industrial field. Aiming at the problems of low traditional detection efficiency, intelligent analysis and processing of industrial data, insufficient generalization ability of single-modal system and hallucination of large models, an enhanced industrial defect intelligent inspection system is proposed and implemented to support multi-source heterogeneous data collaborative processing and efficient decision-making. The system fine-tunes a multimodal large model for industrial anomaly detection. Through improved cross-modal feature alignment algorithms and prompt learning, it achieves semantic fusion of multi-source data such as images and texts, and synchronously outputs semantic descriptions of defect recognition results. It constructs an industrial data knowledge base, uses RAG retrieval to enhance generation, suppresses model hallucinations and improves detection credibility and decision-making effects. Combined with Depth-Anything-V2, it generates highly consistent depth maps to support 3D quantitative analysis of defects, breaking through the limitations of traditional 2D detection. A natural language-driven Excel intelligent analysis module automatically extracts and visualizes quality inspection table data. The OCR intelligent inspection module integrates PaddleOCR and ErnieBot to realize text extraction and semantic understanding of industrial documents. Finally, the agent integrates the core functional modules to provide comprehensive decision-making advice. Tests on 5 typical industrial parts such as metals and screws show that the system’s average recognition accuracy for defect types reaches 95.26%, with an average error of 2.9 pixels in defect location positioning. It expands the automation level and analysis dimensions of industrial quality inspection, providing a practical technical solution for the intelligent transformation of manufacturing.

Key words: industrial defect detection, multimodal fusion, large vision-language model (LVLM), depth map analysis, retrieval-augmented generation (RAG), decision-making, agent