Scaling Trends in Language Model Robustness

21 February 2025

Nikolhaus Howe

Abstract

Language models exhibit scaling laws, whereby increasing model and dataset size predictably decrease negative log likelihood, unlocking a dazzling array of capabilities. At the same time, even the most capable systems are currently vulnerable to adversarial inputs such as jailbreaks and prompt injections, despite concerted efforts to make them robust. As compute becomes more accessible to both attackers and defenders, which side will benefit more from scale? We attempt to answer this question with a detailed study of robustness on language models spanning three orders of magnitude in parameter count. From the defender's perspective, we find that in the absence of other interventions, increasing model size alone does not consistently improve robustness. In adversarial training, we find that larger models are more sample-efficient and less compute-efficient than smaller models, and often better generalize their defense to new threat models. From the attacker's perspective, we find that increasing attack compute smoothly and reliably increases attack success rate against both finetuned and adversarially trained models. Finally, we show that across model sizes studied, doubling compute on adversarial training only forces an attacker to less than double attack compute to maintain the same attack success rate. However, adversarial training becomes more and more effective on larger models, suggesting that defenders could eventually have the advantage with increasing model size. These results underscore the value of adopting a scaling lens when discussing robustness of frontier models.

View on arXiv

@article{howe2025_2407.18213,
  title={ Scaling Trends in Language Model Robustness },
  author={ Nikolaus Howe and Ian McKenzie and Oskar Hollinsworth and Michał Zajac and Tom Tseng and Aaron Tucker and Pierre-Luc Bacon and Adam Gleave },
  journal={arXiv preprint arXiv:2407.18213},
  year={ 2025 }
}

Comments on this paper