Anchor Points: Benchmarking Models with Much Fewer Examples

14 September 2023

Douwe Kiela

Papers citing "Anchor Points: Benchmarking Models with Much Fewer Examples"

25 / 25 papers shown

Title
Efficient Evaluation of Large Language Models via Collaborative Filtering Xu-Xiang Zhong Chao Yi Han-Jia Ye 24 0 0 05 Apr 2025
Reliable and Efficient Amortized Model-based Evaluation Sang T. Truong Yuheng Tu Percy Liang Bo-wen Li Sanmi Koyejo ELM 59 1 0 17 Mar 2025
BenTo: Benchmark Task Reduction with In-Context Transferability Hongyu Zhao Ming Li Lichao Sun Tianyi Zhou 28 0 0 17 Oct 2024
Active Evaluation Acquisition for Efficient LLM Benchmarking Yang Li Jie Ma Miguel Ballesteros Yassine Benajiba Graham Horwood ELM 14 1 0 08 Oct 2024
Instruction Embedding: Latent Representations of Instructions Towards Task Identification Yiwei Li Jiayi Shi Shaoxiong Feng Peiwen Yuan Xinglin Wang Boyuan Pan Heda Wang Yao Hu Kan Li 23 2 0 29 Sep 2024
100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances Lorenzo Pacchiardi Lucy G. Cheke José Hernández Orallo ALM LRM ELM 36 3 0 05 Sep 2024
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models Kaichen Zhang Bo Li Peiyuan Zhang Fanyi Pu Joshua Adrian Cahyono ... Shuai Liu Yuanhan Zhang Jingkang Yang Chunyuan Li Ziwei Liu 85 74 0 17 Jul 2024
Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling Cong Xu Gayathri Saranathan Mahammad Parwez Alam Arpit Shah James Lim Soon Yee Wong Foltin Martin Suparna Bhattacharya VLM 35 3 0 21 Jun 2024
Quantifying Variance in Evaluation Benchmarks Lovish Madaan Aaditya K. Singh Rylan Schaeffer Andrew Poulton Sanmi Koyejo Pontus Stenetorp Sharan Narang Dieuwke Hupkes 33 9 0 14 Jun 2024
Efficient multi-prompt evaluation of LLMs Felipe Maia Polo Ronald Xu Lucas Weber Mírian Silva Onkar Bhardwaj Leshem Choshen Allysson Flavio Melo de Oliveira Yuekai Sun Mikhail Yurochkin 37 17 0 27 May 2024
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks Melissa Ailem Katerina Marazopoulou Charlotte Siska James Bono 51 13 0 25 Apr 2024
FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models Lin Zhao Tianchen Zhao Zinan Lin Xuefei Ning Guohao Dai Huazhong Yang Yu Wang EGVM 42 7 0 25 Mar 2024
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress Ameya Prabhu Vishaal Udandarao Philip H. S. Torr Matthias Bethge Adel Bibi Samuel Albanie 23 5 0 29 Feb 2024
tinyBenchmarks: evaluating LLMs with fewer examples Felipe Maia Polo Lucas Weber Leshem Choshen Yuekai Sun Gongjun Xu Mikhail Yurochkin ELM 24 72 0 22 Feb 2024
Label-Efficient Model Selection for Text Generation Shir Ashury-Tahan Ariel Gera Benjamin Sznajder Leshem Choshen L. Ein-Dor Eyal Shnarch 31 4 0 12 Feb 2024
FinanceBench: A New Benchmark for Financial Question Answering Pranab Islam Anand Kannappan Douwe Kiela Rebecca Qian Nino Scherrer Bertie Vidgen RALM 19 71 0 20 Nov 2023
Post Turing: Mapping the landscape of LLM Evaluation Alexey Tikhonov Ivan P. Yamshchikov ELM 33 4 0 03 Nov 2023
How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench Qinyuan Ye Harvey Yiyun Fu Xiang Ren Robin Jia ELM 19 21 0 24 May 2023
Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs Kelvin Guu Albert Webson Ellie Pavlick Lucas Dixon Ian Tenney Tolga Bolukbasi TDI 66 33 0 14 Mar 2023
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts Stephen H. Bach Victor Sanh Zheng-Xin Yong Albert Webson Colin Raffel ... Khalid Almubarak Xiangru Tang Dragomir R. Radev Mike Tian-Jian Jiang Alexander M. Rush VLM 225 338 0 02 Feb 2022
$Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information$ Understanding Dataset Difficulty with $\mathcal{V}$ -Usable Information Kawin Ethayarajh Yejin Choi Swabha Swayamdipta 157 157 0 16 Oct 2021
GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training Krishnateja Killamsetty D. Sivasubramanian Ganesh Ramakrishnan A. De Rishabh K. Iyer OOD 83 188 0 27 Feb 2021
With Little Power Comes Great Responsibility Dallas Card Peter Henderson Urvashi Khandelwal Robin Jia Kyle Mahowald Dan Jurafsky 225 115 0 13 Oct 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 294 6,943 0 20 Apr 2018
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles Balaji Lakshminarayanan Alexander Pritzel Charles Blundell UQCV BDL 268 5,652 0 05 Dec 2016