Vision Foundation Model Embedding-Based Semantic Anomaly Detection

Semantic anomalies are contextually invalid or unusual combinations of familiar visual elements that can cause undefined behavior and failures in system-level reasoning for autonomous systems. This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models, operating directly on the image. We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant. In this work, we consider two variants of the proposed framework: one using raw grid-based embeddings, and another leveraging instance segmentation for object-centric representations. To further improve robustness, we introduce a simple filtering mechanism to suppress false positives. Our evaluations on CARLA-simulated anomalies show that the instance-based method with filtering achieves performance comparable to GPT-4o, while providing precise anomaly localization. These results highlight the potential utility of vision embeddings from foundation models for real-time anomaly detection in autonomous systems.
View on arXiv@article{ronecker2025_2505.07998, title={ Vision Foundation Model Embedding-Based Semantic Anomaly Detection }, author={ Max Peter Ronecker and Matthew Foutter and Amine Elhafsi and Daniele Gammelli and Ihor Barakaiev and Marco Pavone and Daniel Watzenig }, journal={arXiv preprint arXiv:2505.07998}, year={ 2025 } }