Pretraining on the Test Set Is All You Need

Pretraining on the Test Set Is All You Need

13 September 2023

Rylan Schaeffer

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "Pretraining on the Test Set Is All You Need"

18 / 18 papers shown

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

230

5

0

06 Nov 2025

Efficient Prediction of Pass@k Scaling in Large Language Models

Efficient Prediction of Pass@k Scaling in Large Language Models

Rylan Schaeffer

Youssef Allouah

135

1

0

06 Oct 2025

Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing

Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing

114

0

0

30 Sep 2025

Evaluating the Robustness of Chinchilla Compute-Optimal Scaling

Evaluating the Robustness of Chinchilla Compute-Optimal Scaling

Rylan Schaeffer

187

0

0

28 Sep 2025

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

370

7

0

05 Aug 2025

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Rylan Schaeffer

Punit Singh Koura

Aaditya K. Singh

...

Vedanuj Goswami

341

2

0

24 Feb 2025

Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews

Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience InterviewsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Cassandra A. Cohen

276

0

0

21 Feb 2025

Unveiling the Spectrum of Data Contamination in Language Models: A
Survey from Detection to Remediation

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Xiangru Tang

Arman Cohan

267

28

0

20 Jun 2024

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij

Felix Hofstätter

Samuel F. Brown

Francis Rhys Ward

689

58

0

11 Jun 2024

Kotlin ML Pack: Technical Report

Kotlin ML Pack: Technical Report

Mikhail Evtikhiev

Sergei Boytsov

...

Maksim Sheptyakov

Mikhail Arkhipov

178

0

0

29 May 2024

EnviroExam: Benchmarking Environmental Science Knowledge of Large
Language Models

EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models

207

3

0

18 May 2024

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon: Mixed-Modal Early-Fusion Foundation Models

586

629

0

16 May 2024

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal
language models

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Piotr Padlewski

Matthew Henderson

...

Cyprien de Masson dÁutume

246

31

0

03 May 2024

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

...

356

17

0

10 Apr 2024

Automating Dataset Updates Towards Reliable and Timely Evaluation of
Large Language Models

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

Xuanjing Huang

159

12

0

19 Feb 2024

When Large Language Models Meet Vector Databases: A Survey

When Large Language Models Meet Vector Databases: A Survey

436

72

0

30 Jan 2024

Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

Yujiu Yang

293

8

0

31 Oct 2023

LawBench: Benchmarking Legal Knowledge of Large Language Models

LawBench: Benchmarking Legal Knowledge of Large Language Models

Xiaoyu Shen

259

91

0

28 Sep 2023