Visual anomaly classification and segmentation are vital for automating industrial quality inspection. The focus of prior research in the field has been on training custom models for each quality inspection task, which requires task-specific images and annotation. In this paper we move away from this regime, addressing zero-shot and few-normal-shot anomaly classification and segmentation. Recently CLIP, a vision-language model, has shown revolutionary generality with competitive zero-/few-shot performance in comparison to full-supervision. But CLIP falls short on anomaly classification and segmentation tasks. Hence, we propose window-based CLIP (WinCLIP) with (1) a compositional ensemble on state words and prompt templates and (2) efficient extraction and aggregation of window/patch/image-level features aligned with text. We also propose its few-normal-shot extension WinCLIP+, which uses complementary information from normal images. In MVTec-AD (and VisA), without further tuning, WinCLIP achieves 91.8%/85.1% (78.1%/79.6%) AUROC in zero-shot anomaly classification and segmentation while WinCLIP+ does 93.1%/95.2% (83.8%/96.4%) in 1-normal-shot, surpassing state-of-the-art by large margins.
翻译:视觉异常分类和分割对于自动化工业质量检查至关重要。先前在该领域进行的研究的重点是针对每个质量检查任务训练定制模型,这需要特定于任务的图像和注释。在本文中,我们摆脱了这种模式,解决了零样本和少数新常态样本异常分类和分割的问题。最近,视觉-语言模型CLIP展现了具有竞争力的零样本/少量样本性能的革命性普适性。但CLIP在异常分类和分割任务上表现不佳。因此,我们提出了一种基于窗口的CLIP(WinCLIP)方法,它利用了(1)状态词和提示模板的组合集成和(2)与文本对齐的窗口/补丁/图像级特征的高效提取和聚合。我们还提出了其少数新常态样本扩展WinCLIP+,它利用了来自正常图像的互补信息。在MVTec-AD(和VisA)中,不需要进一步调整,WinCLIP在零样本异常分类和分割中分别实现了91.8%/ 85.1%(78.1%/ 79.6%)的AUROC,而WinCLIP+在1新常态样本时的性能分别达到了93.1%/ 95.2%(83.8%/ 96.4%),大幅度超过了现有技术水平。