ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.03461
  4. Cited By
Do Large Language Model Benchmarks Test Reliability?

Do Large Language Model Benchmarks Test Reliability?

5 February 2025
Joshua Vendrow
Edward Vendrow
Sara Beery
Aleksander Madry
ArXiv (abs)PDFHTML

Papers citing "Do Large Language Model Benchmarks Test Reliability?"

11 / 11 papers shown
Title
EvoLM: In Search of Lost Language Model Training Dynamics
EvoLM: In Search of Lost Language Model Training Dynamics
Zhenting Qi
Fan Nie
Alexandre Alahi
James Zou
Himabindu Lakkaraju
Yilun Du
Eric P. Xing
Sham Kakade
Hanlin Zhang
23
1
0
19 Jun 2025
Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models
Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models
Rylan Schaeffer
Joshua Kazdan
Yegor Denisov-Blanch
24
0
0
16 Jun 2025
Domain Specific Benchmarks for Evaluating Multimodal Large Language Models
Domain Specific Benchmarks for Evaluating Multimodal Large Language Models
Khizar Anjuma
Muhammad Arbab Arshad
Kadhim Hayawi
Efstathios Polyzos
A. Tariq
...
Nishith Reddy Mannuru
Ravi Varma Kumar Bevara
Taslim Mahbub
Muhammad Zeeshan Akram
Sakib Shahriar
ELMLRM
37
0
0
15 Jun 2025
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
Liran Ringel
Elad Tolochinsky
Yaniv Romano
LRM
13
0
0
12 Jun 2025
Benchmarking Misuse Mitigation Against Covert Adversaries
Benchmarking Misuse Mitigation Against Covert Adversaries
Davis Brown
Mahdi Sabbaghi
Luze Sun
Alexander Robey
George Pappas
Eric Wong
Hamed Hassani
15
0
0
06 Jun 2025
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
Takashi Ishida
Thanawat Lodkaew
Ikko Yamane
214
0
0
23 May 2025
Evaluating LLM Metrics Through Real-World Capabilities
Evaluating LLM Metrics Through Real-World Capabilities
Justin K Miller
Wenjia Tang
ELMALM
93
1
0
13 May 2025
What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
Pavel Chizhov
Mattia Nee
Pierre-Carl Langlais
Ivan P. Yamshchikov
ReLMELMLRM
96
1
0
10 Apr 2025
SWI: Speaking with Intent in Large Language Models
SWI: Speaking with Intent in Large Language Models
Yuwei Yin
EunJeong Hwang
Giuseppe Carenini
LRM
129
0
0
27 Mar 2025
À la recherche du sens perdu: your favourite LLM might have more to say than you can understand
K. O. T. Erziev
94
0
0
28 Feb 2025
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Akshat Sharma
Hangliang Ding
Jianping Li
Neel Dani
Minjia Zhang
161
1
0
27 Nov 2024
1