ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2107.07002
  4. Cited By
The Benchmark Lottery

The Benchmark Lottery

14 July 2021
Mostafa Dehghani
Yi Tay
A. Gritsenko
Zhe Zhao
N. Houlsby
Fernando Diaz
Donald Metzler
Oriol Vinyals
ArXivPDFHTML

Papers citing "The Benchmark Lottery"

50 / 65 papers shown
Title
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
Andreas Hochlehnert
Hardik Bhatnagar
Vishaal Udandarao
Samuel Albanie
Ameya Prabhu
Matthias Bethge
ReLM
ALM
LRM
77
4
0
09 Apr 2025
RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code
Dhruv Gautam
Spandan Garg
Jinu Jang
Neel Sundaresan
Roshanak Zilouchian Moghaddam
LLMAG
LRM
67
2
0
10 Mar 2025
Self-supervised Benchmark Lottery on ImageNet: Do Marginal Improvements Translate to Improvements on Similar Datasets?
Utku Ozbulak
Esla Timothy Anzaku
Solha Kang
W. D. Neve
J. Vankerschaver
50
0
0
28 Jan 2025
Predictable Artificial Intelligence
Predictable Artificial Intelligence
Lexin Zhou
Pablo Antonio Moreno Casares
Fernando Martínez-Plumed
John Burden
Ryan Burnell
...
Seán Ó hÉigeartaigh
Danaja Rutar
Wout Schellaert
Konstantinos Voudouris
José Hernández Orallo
44
2
0
08 Jan 2025
Beyond the Numbers: Transparency in Relation Extraction Benchmark
  Creation and Leaderboards
Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
Varvara Arzt
Allan Hanbury
31
1
0
07 Nov 2024
Benchmark Data Repositories for Better Benchmarking
Benchmark Data Repositories for Better Benchmarking
Rachel Longjohn
Markelle Kelly
Sameer Singh
Padhraic Smyth
29
0
0
31 Oct 2024
Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for
  Molecular Graph Classification
Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification
Jakub Adamczyk
Wojciech Czech
35
2
0
16 Jul 2024
Composable Interventions for Language Models
Composable Interventions for Language Models
Arinbjorn Kolbeinsson
Kyle O'Brien
Tianjin Huang
Shanghua Gao
Shiwei Liu
...
Anurag J. Vaidya
Faisal Mahmood
Marinka Zitnik
Tianlong Chen
Thomas Hartvigsen
KELM
MU
80
5
0
09 Jul 2024
Generalizability of experimental studies
Generalizability of experimental studies
Federico Matteucci
Vadim Arzamasov
Jose Cribeiro-Ramallo
Marco Heyden
Konstantin Ntounas
Klemens Bohm
40
0
0
25 Jun 2024
Improving the Validity and Practical Usefulness of AI/ML Evaluations
  Using an Estimands Framework
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework
Olivier Binette
Jerome P. Reiter
28
0
0
14 Jun 2024
Quantifying Variance in Evaluation Benchmarks
Quantifying Variance in Evaluation Benchmarks
Lovish Madaan
Aaditya K. Singh
Rylan Schaeffer
Andrew Poulton
Sanmi Koyejo
Pontus Stenetorp
Sharan Narang
Dieuwke Hupkes
33
9
0
14 Jun 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
130
52
3
23 May 2024
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal
  language models
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
Piotr Padlewski
Max Bain
Matthew Henderson
Zhongkai Zhu
Nishant Relan
...
Che Zheng
Cyprien de Masson dÁutume
Dani Yogatama
Mikel Artetxe
Yi Tay
VLM
82
26
0
03 May 2024
Position: Why We Must Rethink Empirical Research in Machine Learning
Position: Why We Must Rethink Empirical Research in Machine Learning
Moritz Herrmann
F. J. D. Lange
Katharina Eggensperger
Giuseppe Casalicchio
Marcel Wever
Matthias Feurer
David Rügamer
Eyke Hüllermeier
A. Boulesteix
Bernd Bischl
39
6
0
03 May 2024
Inherent Trade-Offs between Diversity and Stability in Multi-Task
  Benchmarks
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Guanhua Zhang
Moritz Hardt
40
7
0
02 May 2024
IID Relaxation by Logical Expressivity: A Research Agenda for Fitting
  Logics to Neurosymbolic Requirements
IID Relaxation by Logical Expressivity: A Research Agenda for Fitting Logics to Neurosymbolic Requirements
M. Stol
Alessandra Mileo
21
1
0
30 Apr 2024
Better than classical? The subtle art of benchmarking quantum machine
  learning models
Better than classical? The subtle art of benchmarking quantum machine learning models
Joseph Bowles
Shahnawaz Ahmed
Maria Schuld
30
62
0
11 Mar 2024
When Benchmarks are Targets: Revealing the Sensitivity of Large Language
  Model Leaderboards
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
Norah A. Alzahrani
H. A. Alyahya
Sultan Yazeed Alnumay
Muhtasim Tahmid
Shaykhah Alsubaie
...
Saleh Soltan
Nathan Scales
Marie-Anne Lachaux
Samuel R. Bowman
Haidar Khan
ELM
15
69
0
01 Feb 2024
Perturbed examples reveal invariances shared by language models
Perturbed examples reveal invariances shared by language models
Ruchit Rawal
Mariya Toneva
AAML
31
0
0
07 Nov 2023
Unleashing the potential of prompt engineering in Large Language Models:
  a comprehensive review
Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review
Banghao Chen
Zhaofeng Zhang
Nicolas Langrené
Shengxin Zhu
LLMAG
19
89
0
23 Oct 2023
A benchmark of categorical encoders for binary classification
A benchmark of categorical encoders for binary classification
Federico Matteucci
Vadim Arzamasov
Klemens Boehm
ELM
19
4
0
17 Jul 2023
Search-Based Regular Expression Inference on a GPU
Search-Based Regular Expression Inference on a GPU
Mojtaba Valizadeh
Martin Berger
15
9
0
29 May 2023
On Degrees of Freedom in Defining and Testing Natural Language
  Understanding
On Degrees of Freedom in Defining and Testing Natural Language Understanding
Saku Sugawara
S. Tsugita
ELM
21
1
0
24 May 2023
Active Learning Principles for In-Context Learning with Large Language
  Models
Active Learning Principles for In-Context Learning with Large Language Models
Katerina Margatina
Timo Schick
Nikolaos Aletras
Jane Dwivedi-Yu
20
39
0
23 May 2023
A benchmark for computational analysis of animal behavior, using
  animal-borne tags
A benchmark for computational analysis of animal behavior, using animal-borne tags
Benjamin Hoffman
M. Cusimano
V. Baglione
D. Canestrari
D. Chevallier
...
O. Vainio
A. Vehkaoja
Ken Yoda
Katie Zacarian
A. Friedlaender
23
7
0
18 May 2023
Towards More Robust NLP System Evaluation: Handling Missing Scores in
  Benchmarks
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi
Ekhine Irurozki
Nathan Noiry
Stéphan Clémençon
Pierre Colombo
19
5
0
17 May 2023
It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and
  Measurements of Performance
It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance
Arjun Subramonian
Xingdi Yuan
Hal Daumé
Su Lin Blodgett
27
17
0
15 May 2023
Can Fairness be Automated? Guidelines and Opportunities for
  Fairness-aware AutoML
Can Fairness be Automated? Guidelines and Opportunities for Fairness-aware AutoML
Hilde J. P. Weerts
Florian Pfisterer
Matthias Feurer
Katharina Eggensperger
Eddie Bergman
Noor H. Awad
Joaquin Vanschoren
Mykola Pechenizkiy
B. Bischl
Frank Hutter
FaML
31
17
0
15 Mar 2023
Scaling Vision Transformers to 22 Billion Parameters
Scaling Vision Transformers to 22 Billion Parameters
Mostafa Dehghani
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Jonathan Heek
...
Mario Luvcić
Xiaohua Zhai
Daniel Keysers
Jeremiah Harmsen
N. Houlsby
MLLM
61
569
0
10 Feb 2023
Toward a Theory of Causation for Interpreting Neural Code Models
Toward a Theory of Causation for Interpreting Neural Code Models
David Nader-Palacio
Alejandro Velasco
Nathan Cooper
Á. Rodríguez
Kevin Moran
Denys Poshyvanyk
13
16
0
07 Feb 2023
Adaptive Computation with Elastic Input Sequence
Adaptive Computation with Elastic Input Sequence
Fuzhao Xue
Valerii Likhosherstov
Anurag Arnab
N. Houlsby
Mostafa Dehghani
Yang You
27
18
0
30 Jan 2023
Evaluation for Change
Evaluation for Change
Rishi Bommasani
ELM
21
0
0
20 Dec 2022
BEANS: The Benchmark of Animal Sounds
BEANS: The Benchmark of Animal Sounds
Masato Hagiwara
Benjamin Hoffman
Jen-Yu Liu
M. Cusimano
Felix Effenberger
Katie Zacarian
37
25
0
21 Oct 2022
Voteñ'Rank: Revision of Benchmarking with Social Choice Theory
Voteñ'Rank: Revision of Benchmarking with Social Choice Theory
Mark Rofin
Vladislav Mikhailov
Mikhail Florinskiy
A. Kravchenko
E. Tutubalina
Tatiana Shavrina
Daniel Karabekyan
Ekaterina Artemova
22
6
0
11 Oct 2022
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
Haw-Shiuan Chang
Ruei-Yao Sun
Kathryn Ricci
Andrew McCallum
27
14
0
10 Oct 2022
Making Intelligence: Ethical Values in IQ and ML Benchmarks
Making Intelligence: Ethical Values in IQ and ML Benchmarks
Borhane Blili-Hamelin
Leif Hancox-Li
17
16
0
01 Sep 2022
Gaussian Process Surrogate Models for Neural Networks
Gaussian Process Surrogate Models for Neural Networks
Michael Y. Li
Erin Grant
Thomas L. Griffiths
BDL
SyDa
23
7
0
11 Aug 2022
Improving Predictive Performance and Calibration by Weight Fusion in
  Semantic Segmentation
Improving Predictive Performance and Calibration by Weight Fusion in Semantic Segmentation
Timo Sämann
A. Hammam
Andrei Bursuc
Christoph Stiller
H. Groß
FedML
17
1
0
22 Jul 2022
Scaling Laws vs Model Architectures: How does Inductive Bias Influence
  Scaling?
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
Yi Tay
Mostafa Dehghani
Samira Abnar
Hyung Won Chung
W. Fedus
J. Rao
Sharan Narang
Vinh Q. Tran
Dani Yogatama
Donald Metzler
AI4CE
22
100
0
21 Jul 2022
Towards Better User Studies in Computer Graphics and Vision
Towards Better User Studies in Computer Graphics and Vision
Zoya Bylinskii
L. Herman
Aaron Hertzmann
Stefanie Hutka
Yile Zhang
12
13
0
23 Jun 2022
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann
Abhik Bhattacharjee
Abinaya Mahendiran
Alex Jinpeng Wang
Alexandros Papangelis
...
Yacine Jernite
Yi Xu
Yisi Sang
Yixin Liu
Yufang Hou
33
37
0
22 Jun 2022
The Role of Machine Learning in Cybersecurity
The Role of Machine Learning in Cybersecurity
Giovanni Apruzzese
P. Laskov
Edgardo Montes de Oca
Wissam Mallouli
Luis Brdalo Rapa
A. Grammatopoulos
Fabio Di Franco
22
128
0
20 Jun 2022
Please, Don't Forget the Difference and the Confidence Interval when
  Seeking for the State-of-the-Art Status
Please, Don't Forget the Difference and the Confidence Interval when Seeking for the State-of-the-Art Status
Yves Bestgen
17
3
0
23 May 2022
SoK: The Impact of Unlabelled Data in Cyberthreat Detection
SoK: The Impact of Unlabelled Data in Cyberthreat Detection
Giovanni Apruzzese
P. Laskov
A.T. Tastemirova
25
28
0
18 May 2022
UL2: Unifying Language Learning Paradigms
UL2: Unifying Language Learning Paradigms
Yi Tay
Mostafa Dehghani
Vinh Q. Tran
Xavier Garcia
Jason W. Wei
...
Tal Schuster
H. Zheng
Denny Zhou
N. Houlsby
Donald Metzler
AI4CE
33
293
0
10 May 2022
deep-significance - Easy and Meaningful Statistical Significance Testing
  in the Age of Neural Networks
deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks
Dennis Ulmer
Christian Hardmeier
J. Frellsen
29
42
0
14 Apr 2022
Experimental Standards for Deep Learning in Natural Language Processing
  Research
Experimental Standards for Deep Learning in Natural Language Processing Research
Dennis Ulmer
Elisa Bassignana
Max Müller-Eberstein
Daniel Varab
Mike Zhang
Rob van der Goot
Christian Hardmeier
Barbara Plank
13
10
0
13 Apr 2022
The worst of both worlds: A comparative analysis of errors in learning
  from data in psychology and machine learning
The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning
Jessica Hullman
Sayash Kapoor
Priyanka Nanayakkara
Andrew Gelman
Arvind Narayanan
11
39
0
12 Mar 2022
Mapping global dynamics of benchmark creation and saturation in
  artificial intelligence
Mapping global dynamics of benchmark creation and saturation in artificial intelligence
Simon Ott
A. Barbosa-Silva
Kathrin Blagec
J. Brauner
Matthias Samwald
17
36
0
09 Mar 2022
What are the best systems? New perspectives on NLP Benchmarking
What are the best systems? New perspectives on NLP Benchmarking
Pierre Colombo
Nathan Noiry
Ekhine Irurozki
Stéphan Clémençon
11
28
0
08 Feb 2022
12
Next