Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2203.04592
Cited By

Mapping global dynamics of benchmark creation and saturation in
artificial intelligence

v1v2v3v4 (latest)

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

Nature Communications (Nat Commun), 2022

9 March 2022

A. Barbosa-Silva

Matthias Samwald

ArXiv (abs)PDF HTML

Papers citing "Mapping global dynamics of benchmark creation and saturation in artificial intelligence"

30 / 30 papers shown

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Yong Li

238

0

0

07 Nov 2025

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Muhammad Usman Safder

Preslav Nakov

Zhuohan Xie

308

0

0

03 Nov 2025

Benchmarking is Broken -- Don't Let AI be its Own Judge

Benchmarking is Broken -- Don't Let AI be its Own Judge

Tassallah Abdullahi

...

Mikołaj Glinka

Carsten Eickhoff

222

5

0

08 Oct 2025

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

...

193

5

0

30 Sep 2025

The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks

The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks

Daniel Khashabi

209

0

0

30 Sep 2025

Quantum Ensembling Methods for Healthcare and Life Science

Quantum Ensembling Methods for Healthcare and Life Science

Kahn Rhrissorrakrai

Kathleen E. Hamilton

Prerana Bangalore Parthsarathy

Aldo Guzman-Saenz

316

0

0

02 Jun 2025

Multi-Modal Language Models as Text-to-Image Model Evaluators

Multi-Modal Language Models as Text-to-Image Model Evaluators

Reyhane Askari Hemmat

Adriana Romero-Soriano

485

1

0

01 May 2025

Auditing the Ethical Logic of Generative AI Models

Auditing the Ethical Logic of Generative AI Models

W. Russell Neuman

350

4

0

24 Apr 2025

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

Jasper Götting

500

23

0

21 Apr 2025

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Erasmo Purificato

Arman Noroozian

Guillaume Chaslot

David Fernandez-Llorca

833

44

0

10 Feb 2025

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and
Establishing Best Practices

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best PracticesNeural Information Processing Systems (NeurIPS), 2024

Amelia F. Hardy

Mykel J. Kochenderfer

620

89

0

20 Nov 2024

Improving Model Evaluation using SMART Filtering of Benchmark Datasets

Improving Model Evaluation using SMART Filtering of Benchmark DatasetsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

797

17

0

26 Oct 2024

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Esteban Garces Arias

Julian Rodemann

449

5

0

24 Oct 2024

Thematic Analysis with Open-Source Generative AI and Machine Learning: A
New Method for Inductive Qualitative Codebook Development

Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

Gabriella Coloyan Fleming

186

13

0

28 Sep 2024

Benchmarks as Microscopes: A Call for Model Metrology

Benchmarks as Microscopes: A Call for Model Metrology

Michael Stephen Saxon

William Y. Wang

Naomi Saphra

368

32

0

22 Jul 2024

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Ahmed E. Hassan

815

5

0

04 Jul 2024

RES-Q: Evaluating Code-Editing Large Language Model Systems at the
Repository Scale

RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale

August Rosedale

209

11

0

24 Jun 2024

Statistical Multicriteria Benchmarking via the GSD-Front

Statistical Multicriteria Benchmarking via the GSD-Front

Christoph Jansen

Julian Rodemann

Thomas Augustin

447

11

0

06 Jun 2024

Philosophy of Cognitive Science in the Age of Deep Learning

Philosophy of Cognitive Science in the Age of Deep Learning

Raphaël Millière

347

8

0

07 May 2024

A Philosophical Introduction to Language Models - Part II: The Way
Forward

A Philosophical Introduction to Language Models - Part II: The Way Forward

Raphael Milliere

Cameron Buckner

337

25

0

06 May 2024

Inherent Trade-Offs between Diversity and Stability in Multi-Task
Benchmarks

Inherent Trade-Offs between Diversity and Stability in Multi-Task BenchmarksInternational Conference on Machine Learning (ICML), 2024

355

21

0

02 May 2024

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid
Progress

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress

Christian Schroeder de Witt

Vishaal Udandarao

Matthias Bethge

271

3

0

29 Feb 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations (ICLR), 2023

Carlos E. Jimenez

Alexander Wettig

Ofir Press

Karthik Narasimhan

545

1,851

0

10 Oct 2023

Operationalising the Definition of General Purpose AI Systems: Assessing
Four Approaches

Operationalising the Definition of General Purpose AI Systems: Assessing Four ApproachesSocial Science Research Network (SSRN), 2023

C. I. Gutierrez

187

3

0

05 Jun 2023

BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors

BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors

Kathryn Wantlin

Shih-Cheng Huang

Farah Z. Dadabhoy

...

Pranav Rajpurkar

204

5

0

17 Apr 2023

Melting Pot 2.0

Melting Pot 2.0

Edgar A. Duénez-Guzmán

...

465

48

0

24 Nov 2022

TAPE: Assessing Few-shot Russian Language Understanding

TAPE: Assessing Few-shot Russian Language UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Ekaterina Taktasheva

Tatiana Shavrina

Alena Fenogenova

Nadezhda Katricheva

...

Svetlana Iordanskaia

Alena Spiridonova

Valentina Kurenshchikova

Ekaterina Artemova

Vladislav Mikhailov

188

16

0

23 Oct 2022

Voteñ'Rank: Revision of Benchmarking with Social Choice Theory

Voteñ'Rank: Revision of Benchmarking with Social Choice TheoryConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022

Vladislav Mikhailov

Mikhail Florinskiy

Tatiana Shavrina

Daniel Karabekyan

Ekaterina Artemova

352

17

0

11 Oct 2022

Law Informs Code: A Legal Informatics Approach to Aligning Artificial
Intelligence with Humans

Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with HumansSocial Science Research Network (SSRN), 2022

1.2K

35

0

14 Sep 2022

ASR in German: A Detailed Error Analysis

ASR in German: A Detailed Error Analysis

165

7

0

12 Apr 2022

Page 1 of 1