ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2508.00161
247
3
v1v2 (latest)

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

31 July 2025
Ziqian Zhong
Aditi Raghunathan
ArXiv (abs)PDFHTMLGithub (10★)
Main:10 Pages
9 Figures
Bibliography:6 Pages
17 Tables
Appendix:13 Pages
Abstract

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution.

View on arXiv
Comments on this paper