14

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

Aayush Gupta
Main:12 Pages
6 Figures
Bibliography:3 Pages
10 Tables
Appendix:3 Pages
Abstract

Existing benchmarks for tool-using LLM agents primarily report single-run success rates and miss reliability properties required in production. We introduce \textbf{ReliabilityBench}, a benchmark for evaluating agent reliability across three dimensions: (i) consistency under repeated execution using passk\mathrm{pass}^k, (ii) robustness to semantically equivalent task perturbations at intensity ϵ\epsilon, and (iii) fault tolerance under controlled tool/API failures at intensity λ\lambda. ReliabilityBench contributes a unified reliability surface R(k,ϵ,λ)R(k,\epsilon,\lambda), \textit{action metamorphic relations} that define correctness via end-state equivalence rather than text similarity, and a chaos-engineering-style fault injection framework (timeouts, rate limits, partial responses, schema drift). We evaluate two models (Gemini 2.0 Flash, GPT-4o) and two agent architectures (ReAct, Reflexion) across four domains (scheduling, travel, customer support, e-commerce) over 1,280 episodes. Perturbations alone reduce success from 96.9% at ϵ=0\epsilon=0 to 88.1% at ϵ=0.2\epsilon=0.2. Rate limiting is the most damaging fault in ablations. ReAct is more robust than Reflexion under combined stress, and Gemini 2.0 Flash achieves comparable reliability to GPT-4o at much lower cost. ReliabilityBench provides a systematic framework for assessing production readiness of LLM agents.

View on arXiv
Comments on this paper