Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

31 July 2025

Ziqian Zhong

Aditi Raghunathan

ArXiv (abs)PDF HTML Github (10★)

Main:10 Pages

9 Figures

Bibliography:6 Pages

17 Tables

Appendix:13 Pages

Abstract

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution.

View on arXiv

Comments on this paper