68
v1v2 (latest)

TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

Alireza Salehi
Ehsan Karami
Sepehr Noey
Sahand Noey
Makoto Yamada
Reshad Hosseini
Mohammad Sabokrou
Main:3 Pages
8 Figures
Bibliography:2 Pages
10 Tables
Appendix:5 Pages
Abstract

Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior works compensate with complex auxiliary modules yet largely overlook the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available atthis http URL.

View on arXiv
Comments on this paper