v1v2 (latest)

TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

3 February 2026

Alireza Salehi

Ehsan Karami

Sepehr Noey

Sahand Noey

Makoto Yamada

Reshad Hosseini

Mohammad Sabokrou

VLM

ArXiv (abs)PDF HTML Github (3★)

Main:3 Pages

8 Figures

Bibliography:2 Pages

10 Tables

Appendix:5 Pages

Abstract

Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior works compensate with complex auxiliary modules yet largely overlook the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available atthis http URL.

View on arXiv

Comments on this paper