Treffer: CLIP-MDC: CLIP encoder based multimodal defect classification with synthetic anomaly generation for real-time surface defect detection.

Title:
CLIP-MDC: CLIP encoder based multimodal defect classification with synthetic anomaly generation for real-time surface defect detection.
Authors:
Ha, Taewon (AUTHOR), Hwang, Chaeseon (AUTHOR), Jeong, Jongpil1 (AUTHOR) jpjeong@skku.edu
Source:
Journal of Intelligent Manufacturing. Jan2026, p1-23.
Database:
Business Source Elite

Weitere Informationen

In this study, using various text prompts that combine objects and defect types, we establish a semantic space linking images and texts, enabling explainable defect predictions using natural language. We introduce contrastive language–image pre-training-based multimodal defect classification (CLIP-MDC), a framework designed for multimodal defect detection and classification in smart manufacturing. The model integrates a lightweight backbone network with contrastive language–image pre-training (CLIP) encoders to perform both pixel-level anomaly segmentation and image-level defect classification effectively in supervised and weakly supervised settings. Additionally, we incorporate a Perlin noise-based synthetic anomaly generation technique to facilitate learning in environments with limited labeled data, and the dual prediction architecture enables accurate simultaneous inference of defect location and type. In experiments on the MVTec AD and KSDD2 datasets, the model achieved outstanding performance with an area under the receiver operating characteristic curve (AUROC) of 99.9%, an area under the per-region overlap curve (AUPRO) of 98.6%, a pixel-level AUROC (P-AUROC) of 99.9%, and an average precision for localization (\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$AP_{loc}$$\end{document}) of 87.6%. It also demonstrated real-time capability, registering an average inference speed of 6.6<italic>ms</italic> on an A100 GPU. CLIP–MDC uses a semantic-based multimodal learning framework that combines visual and linguistic information to deliver accuracy, explainability, generalization, and real-time efficiency in defect detection, making it a practical and scalable solution for industrial defect analysis in real-world manufacturing environments. [ABSTRACT FROM AUTHOR]

Copyright of Journal of Intelligent Manufacturing is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)