Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.15366
Cited By
AI and the Everything in the Whole Wide World Benchmark
26 November 2021
Inioluwa Deborah Raji
Emily M. Bender
Amandalynne Paullada
Emily L. Denton
A. Hanna
Re-assign community
ArXiv
PDF
HTML
Papers citing
"AI and the Everything in the Whole Wide World Benchmark"
40 / 40 papers shown
Title
Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology
A. H. H. Chan
Otto Brookes
Urs Waldmann
Hemal Naik
I. Couzin
...
Lukas Boesch
M. Arandjelovic
H. Kühl
T. Burghardt
Fumihiro Kano
104
0
0
05 May 2025
A Platform for Generating Educational Activities to Teach English as a Second Language
Aiala Rosá
Santiago Góngora
Juan Pablo Filevich
Ignacio Sastre
Laura Musto
Brian Carpenter
Luis Chiruzzo
AI4Ed
46
0
0
28 Apr 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
65
4
0
07 Mar 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez Llorca
ELM
128
1
0
10 Feb 2025
Towards Effective Discrimination Testing for Generative AI
Thomas P. Zollo
Nikita Rajaneesh
Richard Zemel
Talia B. Gillis
Emily Black
30
1
0
31 Dec 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
70
1
0
26 Oct 2024
Exposing Assumptions in AI Benchmarks through Cognitive Modelling
Jonathan H. Rystrøm
Kenneth C. Enevoldsen
32
0
0
25 Sep 2024
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
Zachary S. Siegel
Sayash Kapoor
Nitya Nagdir
Benedikt Stroebl
Arvind Narayanan
29
8
0
17 Sep 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
Varun Magesh
Faiz Surani
Matthew Dahl
Mirac Suzgun
Christopher D. Manning
Daniel E. Ho
HILM
ELM
AILaw
27
65
0
30 May 2024
People cannot distinguish GPT-4 from a human in a Turing test
Cameron R. Jones
Benjamin K. Bergen
ELM
DeLMO
32
30
0
09 May 2024
Natural Language Processing RELIES on Linguistics
Juri Opitz
Shira Wein
Nathan Schneider
AI4CE
44
7
0
09 May 2024
From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility Gap
Tianqi Kou
32
0
0
19 Apr 2024
A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu
Kaiming He
40
28
0
13 Mar 2024
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
Ziqiao Ma
Jacob Sansom
Run Peng
Joyce Chai
45
16
0
30 Oct 2023
A Benchmark for Learning to Translate a New Language from One Grammar Book
Garrett Tanzer
Mirac Suzgun
Chenguang Xi
Dan Jurafsky
Luke Melas-Kyriazi
24
51
0
28 Sep 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Anna Rogers
A. Luccioni
40
19
0
14 Aug 2023
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap
Q. V. Liao
Ziang Xiao
ALM
ELM
43
29
0
01 Jun 2023
A benchmark for computational analysis of animal behavior, using animal-borne tags
Benjamin Hoffman
M. Cusimano
V. Baglione
D. Canestrari
D. Chevallier
...
O. Vainio
A. Vehkaoja
Ken Yoda
Katie Zacarian
A. Friedlaender
25
7
0
18 May 2023
PaLM 2 Technical Report
Rohan Anil
Andrew M. Dai
Orhan Firat
Melvin Johnson
Dmitry Lepikhin
...
Ce Zheng
Wei Zhou
Denny Zhou
Slav Petrov
Yonghui Wu
ReLM
LRM
62
1,138
0
17 May 2023
Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy
David Schlangen
ELM
21
12
0
14 Apr 2023
The Capacity for Moral Self-Correction in Large Language Models
Deep Ganguli
Amanda Askell
Nicholas Schiefer
Thomas I. Liao
Kamil.e Lukovsiut.e
...
Tom B. Brown
C. Olah
Jack Clark
Sam Bowman
Jared Kaplan
LRM
ReLM
31
158
0
15 Feb 2023
Evaluation for Change
Rishi Bommasani
ELM
35
0
0
20 Dec 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
89
2,301
0
09 Nov 2022
System Safety Engineering for Social and Ethical ML Risks: A Case Study
Edgar W. Jatho
L. Mailloux
Shalaleh Rismani
Eugene D. Williams
Joshua A. Kroll
13
2
0
08 Nov 2022
Underspecification in Scene Description-to-Depiction Tasks
Ben Hutchinson
Jason Baldridge
Vinodkumar Prabhakaran
DiffM
66
32
0
11 Oct 2022
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Leandro von Werra
Lewis Tunstall
A. Thakur
A. Luccioni
Tristan Thrush
...
Julien Chaumond
Margaret Mitchell
Alexander M. Rush
Thomas Wolf
Douwe Kiela
ELM
14
24
0
30 Sep 2022
Making Intelligence: Ethical Values in IQ and ML Benchmarks
Borhane Blili-Hamelin
Leif Hancox-Li
27
16
0
01 Sep 2022
Detecting Environmental Violations with Satellite Imagery in Near Real Time: Land Application under the Clean Water Act
Ben Chugg
Nicolas Rothbacher
A. Feng
Xiaoqi Long
Daniel E. Ho
11
2
0
18 Aug 2022
On the role of benchmarking data sets and simulations in method comparison studies
Sarah Friedrich
T. Friede
25
24
0
02 Aug 2022
Robots Enact Malignant Stereotypes
Andrew Hundt
William Agnew
V. Zeng
Severin Kacianka
Matthew C. Gombolay
LM&Ro
19
41
0
23 Jul 2022
Leakage and the Reproducibility Crisis in ML-based Science
Sayash Kapoor
Arvind Narayanan
25
177
0
14 Jul 2022
Applying data technologies to combat AMR: current status, challenges, and opportunities on the way forward
L. Chindelevitch
E. Jauneikaite
N. Wheeler
K. Allel
Bede Yaw Ansiri-Asafoakaa
...
R. Stocker
L. Unruh
Daniel Waruingi
H. Graz
M. V. Dongen
25
4
0
05 Jul 2022
The Fallacy of AI Functionality
Inioluwa Deborah Raji
Indra Elizabeth Kumar
Aaron Horowitz
Andrew D. Selbst
15
179
0
20 Jun 2022
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations
Roy Schwartz
Gabriel Stanovsky
22
24
0
27 Apr 2022
FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing
Ilias Chalkidis
Tommaso Pasini
Shenmin Zhang
Letizia Tomada
Sebastian Felix Schwemer
Anders Søgaard
AILaw
27
54
0
14 Mar 2022
Language technology practitioners as language managers: arbitrating data bias and predictive bias in ASR
Nina Markl
S. McNulty
17
9
0
25 Feb 2022
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Bernard Koch
Emily L. Denton
A. Hanna
J. Foster
31
139
0
03 Dec 2021
Disembodied Machine Learning: On the Illusion of Objectivity in NLP
Zeerak Talat
Smarika Lulz
Joachim Bingel
Isabelle Augenstein
88
51
0
28 Jan 2021
Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
241
1,450
0
18 Mar 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,950
0
20 Apr 2018
1