8
0

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu
Tengjun Jin
Yada Pruksachatkun
Andy Zhang
Shu Liu
Sasha Cui
Sayash Kapoor
Shayne Longpre
Kevin Meng
Rebecca Weiss
Fazl Barez
Rahul Gupta
Jwala Dhamala
Jacob Merizian
Mario Giulianelli
Harry Coppock
Cozmin Ududec
Jasjeet Sekhon
Jacob Steinhardt
Antony Kellerman
Sarah Schwettmann
Matei Zaharia
Ion Stoica
Percy Liang
Daniel Kang
Main:9 Pages
7 Figures
Bibliography:6 Pages
26 Tables
Appendix:24 Pages
Abstract

Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation agents' performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.

View on arXiv
@article{zhu2025_2507.02825,
  title={ Establishing Best Practices for Building Rigorous Agentic Benchmarks },
  author={ Yuxuan Zhu and Tengjun Jin and Yada Pruksachatkun and Andy Zhang and Shu Liu and Sasha Cui and Sayash Kapoor and Shayne Longpre and Kevin Meng and Rebecca Weiss and Fazl Barez and Rahul Gupta and Jwala Dhamala and Jacob Merizian and Mario Giulianelli and Harry Coppock and Cozmin Ududec and Jasjeet Sekhon and Jacob Steinhardt and Antony Kellerman and Sarah Schwettmann and Matei Zaharia and Ion Stoica and Percy Liang and Daniel Kang },
  journal={arXiv preprint arXiv:2507.02825},
  year={ 2025 }
}
Comments on this paper