Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2505.05602
Cited By
v1
v2
v3 (latest)
HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics
8 May 2025
Lennart Luettgau
Harry Coppock
Magda Dubois
Christopher Summerfield
Cozmin Ududec
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"HiBayES: A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics"
6 / 6 papers shown
Title
Measuring what Matters: Construct Validity in Large Language Model Benchmarks
Andrew M. Bean
Ryan Kearns
Angelika Romanou
Franziska Sofia Hafner
Harry Mayne
...
Christopher Summerfield
Philip Torr
Cozmin Ududec
Luc Rocher
Adam Mahdi
ALM
369
2
0
03 Nov 2025
HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
Robab Aghazadeh-Chakherlou
Qing Guo
Siddartha Khastgir
Peter Popov
Xiaoge Zhang
Xingyu Zhao
113
0
0
01 Nov 2025
Do Repetitions Matter? Strengthening Reliability in LLM Evaluations
Miguel Angel Alvarado Gonzalez
Michelle Bruno Hernandez
Miguel Angel Peñaloza Perez
Bruno Lopez Orozco
Jesus Tadeo Cruz Soto
Sandra Malagon
ALM
80
0
0
28 Sep 2025
Measuring AI Ability to Complete Long Tasks
Thomas Kwa
Ben West
Joel Becker
Amy Deng
Katharyn Garcia
...
Lucas Jun Koba Sato
H. Wijk
Daniel M. Ziegler
Elizabeth Barnes
Lawrence Chan
ELM
473
68
0
18 Mar 2025
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
Zhaojian Yu
Yilun Zhao
Arman Cohan
Jinqiang Cui
LRM
239
23
0
03 Jan 2025
Inferring Capabilities from Task Performance with Bayesian Triangulation
John Burden
Konstantinos Voudouris
Ryan Burnell
Danaja Rutar
Lucy G. Cheke
José Hernández-Orallo
124
10
0
21 Sep 2023
1