Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2410.03492
Cited By
v1
v2 (latest)
Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
4 October 2024
Robert E Blackwell
Jon Barry
Anthony G Cohn
UQCV
Re-assign community
ArXiv (abs)
PDF
HTML
Github
Papers citing
"Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores"
10 / 10 papers shown
Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models
Zhiyuan Xu
Stanislav Abaimov
Joseph Gardiner
Sana Belguith
LLMSV
256
0
0
21 Nov 2025
Critical Confabulation: Can LLMs Hallucinate for Social Good?
Peiqi Sui
Eamon Duede
Hoyt Long
Richard Jean So
128
1
0
11 Nov 2025
Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies
Florian Angermeir
Maximilian Amougou
Mark Kreitz
Andreas Bauer
Matthias Linhuber
Davide Fucci
Fabiola Moyón C.
Daniel Méndez
T. Gorschek
225
8
0
29 Oct 2025
Investigating LLM Variability in Personalized Conversational Information Retrieval
Simon Lupart
Daniël van Dijk
Eric Langezaal
Ian van Dort
Mohammad Aliannejadi
138
2
0
04 Oct 2025
Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation
Jonne Sälevä
Duygu Ataman
Constantine Lignos
197
0
0
26 Sep 2025
From Queries to Criteria: Understanding How Astronomers Evaluate LLMs
Alina Hyk
Kiera McCormick
Mian Zhong
I. Ciucă
Sanjib Sharma
John F. Wu
J. E. G. Peek
K. Iyer
Ziang Xiao
Anjalie Field
216
4
0
21 Jul 2025
Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited
Anthony G Cohn
Robert E Blackwell
LRM
ELM
187
2
0
16 Jul 2025
Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations
International Journal of Geographical Information Science (IJGIS), 2025
Yuhan Ji
Song Gao
Ying Nie
Ivan Majic
K. Janowicz
ReLM
LRM
590
29
0
22 May 2025
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges
Xiao Xiao
Yu Su
Sijing Zhang
Zhang Chen
Yadong Chen
Tian Liu
315
7
0
30 Apr 2025
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
796
468
0
22 Jun 2024
1
Page 1 of 1