90
v1v2 (latest)

The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models

Claudia Herambourg
Dawid Siuda
Julia Kopczyńska
Joao R. L. Santos
Wojciech Sas
Joanna Śmietańska-Nowak
Main:22 Pages
6 Figures
6 Tables
Abstract

We present ORCA (Omni Research on Calculation in AI) Benchmark - a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni's calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only 4563%45\text{--}63\,\% accuracy, with errors mainly related to rounding (35%35\,\%) and calculation mistakes (33%33\,\%). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis (r0.400.65r \approx 0.40\text{--}0.65) shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than redundancy. Unlike standard math datasets, ORCA evaluates step-by-step reasoning, numerical precision, and domain generalization across real problems from finance, physics, health, and statistics.

View on arXiv
Comments on this paper