AI and the Everything in the Whole Wide World Benchmark

26 November 2021

Inioluwa Deborah Raji

Papers citing "AI and the Everything in the Whole Wide World Benchmark"

40 / 40 papers shown

Title
Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology A. H. H. Chan Otto Brookes Urs Waldmann Hemal Naik I. Couzin ... Lukas Boesch M. Arandjelovic H. Kühl T. Burghardt Fumihiro Kano 104 0 0 05 May 2025
A Platform for Generating Educational Activities to Teach English as a Second Language Aiala Rosá Santiago Góngora Juan Pablo Filevich Ignacio Sastre Laura Musto Brian Carpenter Luis Chiruzzo AI4Ed 46 0 0 28 Apr 2025
Toward an Evaluation Science for Generative AI Systems Laura Weidinger Deb Raji Hanna M. Wallach Margaret Mitchell Angelina Wang Olawale Salaudeen Rishi Bommasani Sayash Kapoor Deep Ganguli Sanmi Koyejo EGVM ELM 65 4 0 07 Mar 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation Maria Eriksson Erasmo Purificato Arman Noroozian Joao Vinagre Guillaume Chaslot Emilia Gomez David Fernandez Llorca ELM 128 1 0 10 Feb 2025
Towards Effective Discrimination Testing for Generative AI Thomas P. Zollo Nikita Rajaneesh Richard Zemel Talia B. Gillis Emily Black 30 1 0 31 Dec 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets Vipul Gupta Candace Ross David Pantoja R. Passonneau Megan Ung Adina Williams 70 1 0 26 Oct 2024
Exposing Assumptions in AI Benchmarks through Cognitive Modelling Jonathan H. Rystrøm Kenneth C. Enevoldsen 32 0 0 25 Sep 2024
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark Zachary S. Siegel Sayash Kapoor Nitya Nagdir Benedikt Stroebl Arvind Narayanan 29 8 0 17 Sep 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools Varun Magesh Faiz Surani Matthew Dahl Mirac Suzgun Christopher D. Manning Daniel E. Ho HILM ELM AILaw 27 65 0 30 May 2024
People cannot distinguish GPT-4 from a human in a Turing test Cameron R. Jones Benjamin K. Bergen ELM DeLMO 32 30 0 09 May 2024
Natural Language Processing RELIES on Linguistics Juri Opitz Shira Wein Nathan Schneider AI4CE 44 7 0 09 May 2024
From Model Performance to Claim: How a Change of Focus in Machine Learning Replicability Can Help Bridge the Responsibility Gap Tianqi Kou 32 0 0 19 Apr 2024
A Decade's Battle on Dataset Bias: Are We There Yet? Zhuang Liu Kaiming He 40 28 0 13 Mar 2024
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models Ziqiao Ma Jacob Sansom Run Peng Joyce Chai 45 16 0 30 Oct 2023
A Benchmark for Learning to Translate a New Language from One Grammar Book Garrett Tanzer Mirac Suzgun Chenguang Xi Dan Jurafsky Luke Melas-Kyriazi 24 51 0 28 Sep 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes Anna Rogers A. Luccioni 40 19 0 14 Aug 2023
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap Q. V. Liao Ziang Xiao ALM ELM 43 29 0 01 Jun 2023
A benchmark for computational analysis of animal behavior, using animal-borne tags Benjamin Hoffman M. Cusimano V. Baglione D. Canestrari D. Chevallier ... O. Vainio A. Vehkaoja Ken Yoda Katie Zacarian A. Friedlaender 25 7 0 18 May 2023
PaLM 2 Technical Report Rohan Anil Andrew M. Dai Orhan Firat Melvin Johnson Dmitry Lepikhin ... Ce Zheng Wei Zhou Denny Zhou Slav Petrov Yonghui Wu ReLM LRM 62 1,138 0 17 May 2023
Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy David Schlangen ELM 21 12 0 14 Apr 2023
The Capacity for Moral Self-Correction in Large Language Models Deep Ganguli Amanda Askell Nicholas Schiefer Thomas I. Liao Kamil.e Lukovsiut.e ... Tom B. Brown C. Olah Jack Clark Sam Bowman Jared Kaplan LRM ReLM 31 158 0 15 Feb 2023
Evaluation for Change Rishi Bommasani ELM 35 0 0 20 Dec 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model BigScience Workshop : Teven Le Scao Angela Fan Christopher Akiki ... Zhongli Xie Zifan Ye M. Bras Younes Belkada Thomas Wolf VLM 89 2,301 0 09 Nov 2022
System Safety Engineering for Social and Ethical ML Risks: A Case Study Edgar W. Jatho L. Mailloux Shalaleh Rismani Eugene D. Williams Joshua A. Kroll 13 2 0 08 Nov 2022
Underspecification in Scene Description-to-Depiction Tasks Ben Hutchinson Jason Baldridge Vinodkumar Prabhakaran DiffM 66 32 0 11 Oct 2022
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements Leandro von Werra Lewis Tunstall A. Thakur A. Luccioni Tristan Thrush ... Julien Chaumond Margaret Mitchell Alexander M. Rush Thomas Wolf Douwe Kiela ELM 14 24 0 30 Sep 2022
Making Intelligence: Ethical Values in IQ and ML Benchmarks Borhane Blili-Hamelin Leif Hancox-Li 27 16 0 01 Sep 2022
Detecting Environmental Violations with Satellite Imagery in Near Real Time: Land Application under the Clean Water Act Ben Chugg Nicolas Rothbacher A. Feng Xiaoqi Long Daniel E. Ho 11 2 0 18 Aug 2022
On the role of benchmarking data sets and simulations in method comparison studies Sarah Friedrich T. Friede 25 24 0 02 Aug 2022
Robots Enact Malignant Stereotypes Andrew Hundt William Agnew V. Zeng Severin Kacianka Matthew C. Gombolay LM&Ro 19 41 0 23 Jul 2022
Leakage and the Reproducibility Crisis in ML-based Science Sayash Kapoor Arvind Narayanan 25 177 0 14 Jul 2022
Applying data technologies to combat AMR: current status, challenges, and opportunities on the way forward L. Chindelevitch E. Jauneikaite N. Wheeler K. Allel Bede Yaw Ansiri-Asafoakaa ... R. Stocker L. Unruh Daniel Waruingi H. Graz M. V. Dongen 25 4 0 05 Jul 2022
The Fallacy of AI Functionality Inioluwa Deborah Raji Indra Elizabeth Kumar Aaron Horowitz Andrew D. Selbst 15 179 0 20 Jun 2022
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations Roy Schwartz Gabriel Stanovsky 22 24 0 27 Apr 2022
FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing Ilias Chalkidis Tommaso Pasini Shenmin Zhang Letizia Tomada Sebastian Felix Schwemer Anders Søgaard AILaw 27 54 0 14 Mar 2022
Language technology practitioners as language managers: arbitrating data bias and predictive bias in ASR Nina Markl S. McNulty 17 9 0 25 Feb 2022
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research Bernard Koch Emily L. Denton A. Hanna J. Foster 31 139 0 03 Dec 2021
Disembodied Machine Learning: On the Illusion of Objectivity in NLP Zeerak Talat Smarika Lulz Joachim Bingel Isabelle Augenstein 88 51 0 28 Jan 2021
Pre-trained Models for Natural Language Processing: A Survey Xipeng Qiu Tianxiang Sun Yige Xu Yunfan Shao Ning Dai Xuanjing Huang LM&MA VLM 241 1,450 0 18 Mar 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 294 6,950 0 20 Apr 2018