34
0

L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

Abstract

Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve "level-0" reasoning and potential directions to build more reliable reasoning systems.

View on arXiv
@article{sun2025_2503.22832,
  title={ L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution },
  author={ Simeng Sun and Cheng-Ping Hsieh and Faisal Ladhak and Erik Arakelyan and Santiago Akle Serano and Boris Ginsburg },
  journal={arXiv preprint arXiv:2503.22832},
  year={ 2025 }
}
Comments on this paper