ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2510.07575
110
1
v1v2 (latest)

Benchmarking is Broken -- Don't Let AI be its Own Judge

8 October 2025
Zerui Cheng
Stella Wohnig
Ruchika Gupta
Samiul Alam
Tassallah Abdullahi
João Alves Ribeiro
Christian Nielsen-Garcia
Saif Mir
Siran Li
Jason Orender
Seyed Ali Bahrainian
Daniel Kirste
Aaron Gokaslan
Mikołaj Glinka
Carsten Eickhoff
Ruben Wolff
    ELM
ArXiv (abs)PDFHTML
Main:10 Pages
2 Figures
Bibliography:2 Pages
1 Tables
Appendix:2 Pages
Abstract

The meteoric rise of AI, with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench (with its prototype implementation atthis https URL), a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.

View on arXiv
Comments on this paper