Forecasting Rare Language Model Behaviors

24 February 2025

Abstract

Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.

View on arXiv

@article{jones2025_2502.16797,
  title={ Forecasting Rare Language Model Behaviors },
  author={ Erik Jones and Meg Tong and Jesse Mu and Mohammed Mahfoud and Jan Leike and Roger Grosse and Jared Kaplan and William Fithian and Ethan Perez and Mrinank Sharma },
  journal={arXiv preprint arXiv:2502.16797},
  year={ 2025 }
}

Comments on this paper