Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2309.08632
Cited By
Pretraining on the Test Set Is All You Need
13 September 2023
Rylan Schaeffer
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"Pretraining on the Test Set Is All You Need"
18 / 18 papers shown
Title
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis L Brown
Jihan Yang
Shusheng Yang
Rob Fergus
Saining Xie
VLM
230
5
0
06 Nov 2025
Efficient Prediction of Pass@k Scaling in Large Language Models
Joshua Kazdan
Rylan Schaeffer
Youssef Allouah
Colin Sullivan
Kyssen Yu
Noam Levi
Sanmi Koyejo
OffRL
119
1
0
06 Oct 2025
Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing
Yang Tang
Ruijie Liu
Yifan Wang
Shiyu Li
Xi Chen
98
0
0
30 Sep 2025
Evaluating the Robustness of Chinchilla Compute-Optimal Scaling
Rylan Schaeffer
Noam Levi
Andreas Kirsch
Theo Guenais
Brando Miranda
Elyas Obbad
Sanmi Koyejo
LRM
161
0
0
28 Sep 2025
Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs
Aryan Gulati
Brando Miranda
Eric Chen
Emily Xia
Kai Fronsdal
Bruno Dumont
Elyas Obbad
Sanmi Koyejo
AIMat
ReLM
LRM
350
6
0
05 Aug 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
333
2
0
24 Feb 2025
Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Mengqiao Liu
Tevin Wang
Cassandra A. Cohen
Sarah Li
Chenyan Xiong
LRM
259
0
0
21 Feb 2025
Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
Chunyuan Deng
Yilun Zhao
Yuzhao Heng
Yitong Li
Jiannan Cao
Xiangru Tang
Arman Cohan
255
27
0
20 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
596
57
0
11 Jun 2024
Kotlin ML Pack: Technical Report
Sergey Titov
Mikhail Evtikhiev
Anton Shapkin
Oleg Smirnov
Sergei Boytsov
...
Dariia Karaeva
Maksim Sheptyakov
Mikhail Arkhipov
T. Bryksin
Egor Bogomolov
159
0
0
29 May 2024
EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models
Yu Huang
Liang Guo
Wanqian Guo
Zhe Tao
Yang Lv
Zhihao Sun
Dongfang Zhao
ELM
199
3
0
18 May 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
520
616
0
16 May 2024
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
Piotr Padlewski
Max Bain
Matthew Henderson
Zhongkai Zhu
Nishant Relan
...
Che Zheng
Cyprien de Masson dÁutume
Dani Yogatama
Mikel Artetxe
Yi Tay
VLM
240
31
0
03 May 2024
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
Kehua Feng
Keyan Ding
Hongzhi Tan
Kede Ma
Zhihua Wang
...
Yuzhou Cheng
Ge Sun
Guozhou Zheng
Qiang Zhang
H. Chen
336
16
0
10 Apr 2024
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
Jiahao Ying
Yixin Cao
Yushi Bai
Qianru Sun
Bo Wang
Wei Tang
Zhaojun Ding
Yizhe Yang
Xuanjing Huang
Shuicheng Yan
KELM
147
12
0
19 Feb 2024
When Large Language Models Meet Vector Databases: A Survey
Zhi Jing
Yongye Su
Yikun Han
Bo Yuan
Haiyun Xu
Chunjiang Liu
Kehai Chen
Min Zhang
416
70
0
30 Jan 2024
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models
Tian Liang
Zhiwei He
Shu Yang
Wenxuan Wang
Wenxiang Jiao
Rui Wang
Yujiu Yang
Zhaopeng Tu
Shuming Shi
Xing Wang
LLMAG
253
8
0
31 Oct 2023
LawBench: Benchmarking Legal Knowledge of Large Language Models
Zhiwei Fei
Xiaoyu Shen
D. Zhu
Fengzhe Zhou
Zhuo Han
Songyang Zhang
Kai-xiang Chen
Zongwen Shen
Jidong Ge
ELM
AILaw
254
83
0
28 Sep 2023
1