ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2203.04592
  4. Cited By
Mapping global dynamics of benchmark creation and saturation in
  artificial intelligence
v1v2v3v4 (latest)

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

Nature Communications (Nat Commun), 2022
9 March 2022
Simon Ott
A. Barbosa-Silva
Kathrin Blagec
J. Brauner
Matthias Samwald
ArXiv (abs)PDFHTML

Papers citing "Mapping global dynamics of benchmark creation and saturation in artificial intelligence"

30 / 30 papers shown
AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent
AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent
Yu-Feng Li
L. Li
Qingmin Liao
Fengli Xu
Yong Li
Yong Li
LM&Ro
238
0
0
07 Nov 2025
EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning
EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning
Ayesha Gull
Muhammad Usman Safder
Rania Elbadry
Preslav Nakov
Zhuohan Xie
Preslav Nakov
Zhuohan Xie
ELMLRM
308
0
0
03 Nov 2025
Benchmarking is Broken -- Don't Let AI be its Own Judge
Benchmarking is Broken -- Don't Let AI be its Own Judge
Zerui Cheng
Stella Wohnig
Ruchika Gupta
Samiul Alam
Tassallah Abdullahi
...
Daniel Kirste
Aaron Gokaslan
Mikołaj Glinka
Carsten Eickhoff
Ruben Wolff
ELM
222
5
0
08 Oct 2025
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Minhui Zhu
Minyang Tian
Xiaocheng Yang
Tianci Zhou
Lifan Yuan
...
Ruixing Zhang
X. Wang
Ofir Press
Nicolas Chia
Eliu A. Huerta
LRMELM
193
5
0
30 Sep 2025
The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks
The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks
Arda Uzunoglu
Tianjian Li
Daniel Khashabi
209
0
0
30 Sep 2025
Quantum Ensembling Methods for Healthcare and Life Science
Quantum Ensembling Methods for Healthcare and Life Science
Kahn Rhrissorrakrai
Kathleen E. Hamilton
Prerana Bangalore Parthsarathy
Aldo Guzman-Saenz
Tyler Alban
Filippo Utro
Laxmi Parida
316
0
0
02 Jun 2025
Multi-Modal Language Models as Text-to-Image Model Evaluators
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
485
1
0
01 May 2025
Auditing the Ethical Logic of Generative AI Models
Auditing the Ethical Logic of Generative AI Models
W. Russell Neuman
Chad Coleman
Ali Dasdan
Safinah Ali
Manan Shah
ELMLRM
350
4
0
24 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Jasper Götting
Pedro Medeiros
Jon G Sanders
Nathaniel Li
Long Phan
Karam Elabd
Lennart Justen
Dan Hendrycks
Seth Donoughe
ELM
500
23
0
21 Apr 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez-Llorca
ELM
833
44
0
10 Feb 2025
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and
  Establishing Best Practices
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best PracticesNeural Information Processing Systems (NeurIPS), 2024
Anka Reuel
Amelia F. Hardy
Chandler Smith
Max Lamparth
Malcolm Hardy
Mykel J. Kochenderfer
ELM
620
89
0
20 Nov 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark DatasetsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
797
17
0
26 Oct 2024
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
Esteban Garces Arias
Hannah Blocher
Julian Rodemann
Meimingwei Li
Gaojuan Fan
Yi Men
449
5
0
24 Oct 2024
Thematic Analysis with Open-Source Generative AI and Machine Learning: A
  New Method for Inductive Qualitative Codebook Development
Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development
Andrew Katz
Gabriella Coloyan Fleming
Joyce Main
186
13
0
28 Sep 2024
Benchmarks as Microscopes: A Call for Model Metrology
Benchmarks as Microscopes: A Call for Model Metrology
Michael Stephen Saxon
Ari Holtzman
Peter West
William Y. Wang
Naomi Saphra
368
32
0
22 Jul 2024
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
Zhimin Zhao
A. A. Bangash
F. Côgo
Bram Adams
Ahmed E. Hassan
815
5
0
04 Jul 2024
RES-Q: Evaluating Code-Editing Large Language Model Systems at the
  Repository Scale
RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale
Beck Labash
August Rosedale
Alex Reents
Lucas Negritto
Colin Wiel
KELM
209
11
0
24 Jun 2024
Statistical Multicriteria Benchmarking via the GSD-Front
Statistical Multicriteria Benchmarking via the GSD-Front
Christoph Jansen
G. Schollmeyer
Julian Rodemann
Hannah Blocher
Thomas Augustin
447
11
0
06 Jun 2024
Philosophy of Cognitive Science in the Age of Deep Learning
Philosophy of Cognitive Science in the Age of Deep Learning
Raphaël Millière
AI4CENAI
347
8
0
07 May 2024
A Philosophical Introduction to Language Models - Part II: The Way
  Forward
A Philosophical Introduction to Language Models - Part II: The Way Forward
Raphael Milliere
Cameron Buckner
LRM
337
25
0
06 May 2024
Inherent Trade-Offs between Diversity and Stability in Multi-Task
  Benchmarks
Inherent Trade-Offs between Diversity and Stability in Multi-Task BenchmarksInternational Conference on Machine Learning (ICML), 2024
Guanhua Zhang
Moritz Hardt
355
21
0
02 May 2024
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid
  Progress
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress
Christian Schroeder de Witt
Vishaal Udandarao
Juil Sock
Matthias Bethge
Adel Bibi
Samuel Albanie
271
3
0
29 Feb 2024
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?International Conference on Learning Representations (ICLR), 2023
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
545
1,851
0
10 Oct 2023
Operationalising the Definition of General Purpose AI Systems: Assessing
  Four Approaches
Operationalising the Definition of General Purpose AI Systems: Assessing Four ApproachesSocial Science Research Network (SSRN), 2023
Risto Uuk
C. I. Gutierrez
Alex Tamkin
187
3
0
05 Jun 2023
BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors
BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors
Kathryn Wantlin
Chenwei Wu
Shih-Cheng Huang
Oishi Banerjee
Farah Z. Dadabhoy
...
A. Adamson
Laura Heacock
G. Tison
Alex Tamkin
Pranav Rajpurkar
SSLOOD
204
5
0
17 Apr 2023
Melting Pot 2.0
Melting Pot 2.0
J. Agapiou
A. Vezhnevets
Edgar A. Duénez-Guzmán
Jayd Matyas
Yiran Mao
...
Sukhdeep Singh
Julia Haas
Igor Mordatch
D. Mobbs
Joel Z Leibo
465
48
0
24 Nov 2022
TAPE: Assessing Few-shot Russian Language Understanding
TAPE: Assessing Few-shot Russian Language UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Ekaterina Taktasheva
Tatiana Shavrina
Alena Fenogenova
Denis Shevelev
Nadezhda Katricheva
...
Svetlana Iordanskaia
Alena Spiridonova
Valentina Kurenshchikova
Ekaterina Artemova
Vladislav Mikhailov
AAML
188
16
0
23 Oct 2022
Voteñ'Rank: Revision of Benchmarking with Social Choice Theory
Voteñ'Rank: Revision of Benchmarking with Social Choice TheoryConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Mark Rofin
Vladislav Mikhailov
Mikhail Florinskiy
A. Kravchenko
E. Tutubalina
Tatiana Shavrina
Daniel Karabekyan
Ekaterina Artemova
352
17
0
11 Oct 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial
  Intelligence with Humans
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with HumansSocial Science Research Network (SSRN), 2022
John J. Nay
ELMAILaw
1.2K
35
0
14 Sep 2022
ASR in German: A Detailed Error Analysis
ASR in German: A Detailed Error Analysis
John M. Wirth
René Peinl
165
7
0
12 Apr 2022
1
Page 1 of 1