Stress-Testing Model Specs Reveals Character Differences among Language Models

Main:15 Pages
14 Figures
Bibliography:3 Pages
3 Tables
Appendix:9 Pages
Abstract
Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs.
View on arXivComments on this paper
