Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2410.03492
Cited By

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

v1v2 (latest)

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

4 October 2024

Robert E Blackwell

ArXiv (abs)PDF HTML Github

Papers citing "Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores"

10 / 10 papers shown

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Stanislav Abaimov

Joseph Gardiner

256

0

0

21 Nov 2025

Critical Confabulation: Can LLMs Hallucinate for Social Good?

Critical Confabulation: Can LLMs Hallucinate for Social Good?

Richard Jean So

128

1

0

11 Nov 2025

Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies

Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies

Florian Angermeir

Maximilian Amougou

Matthias Linhuber

Fabiola Moyón C.

225

8

0

29 Oct 2025

Investigating LLM Variability in Personalized Conversational Information Retrieval

Investigating LLM Variability in Personalized Conversational Information Retrieval

Daniël van Dijk

Mohammad Aliannejadi

138

2

0

04 Oct 2025

Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation

Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation

Constantine Lignos

197

0

0

26 Sep 2025

From Queries to Criteria: Understanding How Astronomers Evaluate LLMs

From Queries to Criteria: Understanding How Astronomers Evaluate LLMs

Kiera McCormick

216

4

0

21 Jul 2025

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

Robert E Blackwell

187

2

0

16 Jul 2025

Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial RelationsInternational Journal of Geographical Information Science (IJGIS), 2025

590

29

0

22 May 2025

Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges

Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges

315

7

0

30 Apr 2025

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Wenhao Yu

...

David Lo

Xiaoning Du

Leandro von Werra

796

468

0

22 Jun 2024

Page 1 of 1