v1v2 (latest)

AutoEval Done Right: Using Synthetic Data for Model Evaluation

9 March 2024

Pierre Boyeau

Anastasios Nikolas Angelopoulos

Papers citing "AutoEval Done Right: Using Synthetic Data for Model Evaluation"

25 / 25 papers shown

Title
How to Correctly Report LLM-as-a-Judge Evaluations Chungpa Lee Thomas Zeng Jongwon Jeong Jy-yong Sohn Kangwook Lee 149 1 0 26 Nov 2025
Extending Prediction-Powered Inference through Conformal Prediction Daniel Csillag Pedro DallÁntonia C. Struchiner G. Goedert 121 0 0 17 Oct 2025
Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees Meshi Bashari Yonghoon Lee Roy Maor Lotan Edgar Dobriban Yaniv Romano SyDa 144 1 0 24 Sep 2025
Statistical Methods in Generative AI Edgar Dobriban 257 3 0 08 Sep 2025
Towards a rigorous evaluation of RAG systems: the challenge of due diligence Grégoire Martinon Alexandra Lorenzo de Brionne Jérôme Bohard Antoine Lojou Damien Hervault Nicolas Brunel 152 1 0 29 Jul 2025
Sim2Val: Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation Rachel Luo Heng Yang Michael Watson Apoorva Sharma Sushant Veer Edward Schmerling Marco Pavone 51 2 0 25 Jun 2025
Cost-Optimal Active AI Model Evaluation Anastasios Nikolas Angelopoulos Jacob Eisenstein Jonathan Berant Alekh Agarwal Adam Fisch 153 2 0 09 Jun 2025
Data Swarms: Optimizable Generation of Synthetic Evaluation Data Shangbin Feng Yike Wang Weijia Shi Yulia Tsvetkov 303 0 0 31 May 2025
GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Qingchen Yu Zifan Zheng Ding Chen Simin Niu Bo Tang Feiyu Xiong Zhiyu Li ELM LRM 145 3 0 28 May 2025
No Free Lunch: Non-Asymptotic Analysis of Prediction-Powered Inference P. Mani Peng Xu Zachary Chase Lipton Michael Oberst 244 2 0 26 May 2025
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees Sangwoo Park Matteo Zecchin Osvaldo Simeone 154 2 0 24 May 2025
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs G. Wang Zhiwen Chen Bo Li Haifeng Xu 849 2 0 02 May 2025
Validating LLM-as-a-Judge Systems under Rating Indeterminacy Luke M. Guerdan Solon Barocas Kenneth Holstein Hanna M. Wallach Zhiwei Steven Wu Alexandra Chouldechova ALM ELM 1.1K 3 0 07 Mar 2025
Accelerating Unbiased LLM Evaluation via Synthetic Feedback Zhaoyi Zhou Yuda Song Andrea Zanette ALM 300 3 0 14 Feb 2025
Evaluation of Large Language Models via Coupled Token Generation N. C. Benz Stratis Tsirtsis Eleni Straitouri Ivi Chatzi Ander Artola Velasco Suhas Thejaswi Manuel Gomez Rodriguez 299 3 0 03 Feb 2025
Regression for the Mean: Auto-Evaluation and Inference with Few Labels through Post-hoc Regression Benjamin Eyre David Madras 325 5 0 19 Nov 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the dataInternational Conference on Learning Representations (ICLR), 2024 Florian E. Dorner Vivian Y. Nastl Moritz Hardt ELM ALM 340 20 0 17 Oct 2024
Language Model Preference Evaluation with Multiple Weak Evaluators Zhengyu Hu Jieyu Zhang Zhihan Xiong Alexander Ratner Hui Xiong Ranjay Krishna 339 10 0 14 Oct 2024
ChainBuddy: An AI Agent System for Generating LLM PipelinesInternational Conference on Human Factors in Computing Systems (CHI), 2024 Jingyue Zhang Ian Arawjo LLMAG 171 0 0 20 Sep 2024
Can Unconfident LLM Annotations Be Used for Confident Conclusions?North American Chapter of the Association for Computational Linguistics (NAACL), 2024 Kristina Gligorić Tijana Zrnic Cinoo Lee Emmanuel J. Candès Dan Jurafsky 344 25 0 27 Aug 2024
AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews Keith Tyser Ben Segev Gaston Longhitano Xin-Yu Zhang Zachary Meeks ... Nicholas Belsten A. Shporer Madeleine Udell Dov Te’eni Iddo Drori 130 42 0 19 Aug 2024
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation Adam Fisch Joshua Maynez R. A. Hofer Bhuwan Dhingra Amir Globerson William W. Cohen 206 16 0 06 Jun 2024
A Note on the Prediction-Powered Bootstrap Tijana Zrnic 281 6 0 28 May 2024
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences Shreya Shankar J.D. Zamfirescu-Pereira Bjorn Hartmann Aditya G. Parameswaran Ian Arawjo ALM 177 174 0 18 Apr 2024
Prediction-Powered Ranking of Large Language Models Ivi Chatzi Eleni Straitouri Suhas Thejaswi Manuel Gomez Rodriguez ALM 351 13 0 27 Feb 2024