68

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Main:4 Pages
5 Figures
Bibliography:1 Pages
Appendix:3 Pages
Abstract

Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single 7070B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across 1515 models scaling from 0.270.27B to 7070B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at this https URL.

View on arXiv
Comments on this paper