How predictable is language model benchmark performance?

9 January 2024

Papers citing "How predictable is language model benchmark performance?"

16 / 16 papers shown

Title
Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance Takuya Tamura Taro Yano Masafumi Enomoto M. Oyamada 39 0 0 28 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark Jasper Götting Pedro Medeiros Jon G Sanders Nathaniel Li Long Phan Karam Elabd Lennart Justen Dan Hendrycks Seth Donoughe ELM 49 2 0 21 Apr 2025
Measuring AI Ability to Complete Long Tasks Thomas Kwa Ben West Joel Becker Amy Deng Katharyn Garcia ... Lucas Jun Koba Sato H. Wijk Daniel M. Ziegler Elizabeth Barnes Lawrence Chan ELM 75 6 0 18 Mar 2025
Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective Chengyin Xu Kaiyuan Chen Xiao Li Ke Shen Chenggang Li OffRL 41 0 0 24 Feb 2025
Forecasting Frontier Language Model Agent Capabilities Govind Pimpale Axel Højmark Jérémy Scheurer Marius Hobbhahn LLMAG ELM 41 1 0 21 Feb 2025
Predictable Artificial Intelligence Lexin Zhou Pablo Antonio Moreno Casares Fernando Martínez-Plumed John Burden Ryan Burnell ... Seán Ó hÉigeartaigh Danaja Rutar Wout Schellaert Konstantinos Voudouris José Hernández Orallo 41 2 0 08 Jan 2025
Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families Felipe Maia Polo S. Kamath S Leshem Choshen Yuekai Sun Mikhail Yurochkin 82 5 0 09 Dec 2024
Predicting Emergent Capabilities by Finetuning Charlie Snell Eric Wallace Dan Klein Sergey Levine ELM LRM 75 5 0 25 Nov 2024
A Hitchhiker's Guide to Scaling Law Estimation Leshem Choshen Yang Zhang Jacob Andreas 41 6 0 15 Oct 2024
U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models Tung-Yu Wu Pei-Yu Lo ReLM LRM 40 2 0 02 Oct 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities Ezra Karger Houtan Bastani Chen Yueh-Han Zachary Jacobs Danny Halawi Fred Zhang P. Tetlock 33 6 0 30 Sep 2024
Improving Pretraining Data Using Perplexity Correlations Tristan Thrush Christopher Potts Tatsunori Hashimoto 32 17 0 09 Sep 2024
100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances Lorenzo Pacchiardi Lucy G. Cheke José Hernández Orallo ALM LRM ELM 36 3 0 05 Sep 2024
Performance Law of Large Language Models Chuhan Wu Ruiming Tang LRM 38 2 0 19 Aug 2024
Collaborative Performance Prediction for Large Language Models Qiyuan Zhang Fuyuan Lyu Xue Liu Chen Ma 23 3 0 01 Jul 2024
Revisiting Neural Scaling Laws in Language and Vision Ibrahim M. Alabdulmohsin Behnam Neyshabur Xiaohua Zhai 148 101 0 13 Sep 2022