State of What Art? A Call for Multi-Prompt LLM Evaluation

31 December 2023

Gabriel Stanovsky

Papers citing "State of What Art? A Call for Multi-Prompt LLM Evaluation"

45 / 95 papers shown

Title
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios Samuel Ackerman Ella Rabinovich E. Farchi Ateret Anaby-Tavor 21 1 0 04 Aug 2024
Improving Minimum Bayes Risk Decoding with Multi-Prompt David Heineman Yao Dou Wei-ping Xu 29 6 0 22 Jul 2024
Questionable practices in machine learning Gavin Leech Juan J. Vazquez Misha Yagudin Niclas Kupper Laurence Aitchison 42 2 0 17 Jul 2024
Social Bias Evaluation for Large Language Models Requires Prompt Variations Rem Hida Masahiro Kaneko Naoaki Okazaki 38 13 0 03 Jul 2024
Paraphrase Types Elicit Prompt Engineering Capabilities Jan Philip Wahle Terry Ruas Yang Xu Bela Gipp 29 5 0 28 Jun 2024
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation Christoph Leiter Steffen Eger 27 7 0 26 Jun 2024
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt Deng Cai Huayang Li Tingchen Fu Siheng Li Weiwen Xu ... Leyang Cui Yan Wang Lemao Liu Taro Watanabe Shuming Shi KELM 26 2 0 24 Jun 2024
SEAM: A Stochastic Benchmark for Multi-Document Tasks Gili Lior Avi Caciularu Arie Cattan Shahar Levy Ori Shapira Gabriel Stanovsky RALM 33 4 0 23 Jun 2024
An Investigation of Prompt Variations for Zero-shot LLM-based Rankers Shuoqi Sun Shengyao Zhuang Shuai Wang Guido Zuccon 40 5 0 20 Jun 2024
ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models Hwiyeol Jo Hyunwoo Lee Taiwoo Park 21 0 0 19 Jun 2024
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance Kyle Moore Jesse Roberts Thao Pham Oseremhen Ewaleifoh Doug Fisher 40 2 0 17 Jun 2024
KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs Aihua Pei Zehua Yang Shunan Zhu Ruoxi Cheng Ju Jia Lina Wang 29 1 0 16 Jun 2024
Evaluation and Continual Improvement for an Enterprise AI Assistant Akash Maharaj Kun Qian Uttaran Bhattacharya Sally Fang Horia Galatanu ... Rachel Hanessian Nishant Kapoor Ken Russell Shivakumar Vaithyanathan Yunyao Li 21 4 0 15 Jun 2024
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework Olivier Binette Jerome P. Reiter 28 0 0 14 Jun 2024
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages Andrew M. Bean Simi Hellsten Harry Mayne Jabez Magomere Ethan A. Chi Ryan A. Chi Scott A. Hale Hannah Rose Kirk ELM LRM 34 6 0 10 Jun 2024
On the Worst Prompt Performance of Large Language Models Bowen Cao Deng Cai Zhisong Zhang Yuexian Zou Wai Lam ALM LRM 25 5 0 08 Jun 2024
Efficient multi-prompt evaluation of LLMs Felipe Maia Polo Ronald Xu Lucas Weber Mírian Silva Onkar Bhardwaj Leshem Choshen Allysson Flavio Melo de Oliveira Yuekai Sun Mikhail Yurochkin 37 17 0 27 May 2024
A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns Asaf Yehudai Taelin Karidi Gabriel Stanovsky Ariel Goldstein Omri Abend 33 1 0 23 May 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models Stella Biderman Hailey Schoelkopf Lintang Sutawika Leo Gao J. Tow ... Xiangru Tang Kevin A. Wang Genta Indra Winata Franccois Yvon Andy Zou ELM ALM 125 52 3 23 May 2024
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs Sylvain Kouemo Ngassom Arghavan Moradi Dakhel Florian Tambon Foutse Khomh 27 2 0 22 May 2024
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks Melissa Ailem Katerina Marazopoulou Charlotte Siska James Bono 51 13 0 25 Apr 2024
Stronger Random Baselines for In-Context Learning Gregory Yauney David M. Mimno 42 2 0 19 Apr 2024
Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines? Laura Majer Jan Snajder 26 3 0 18 Apr 2024
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency Xenia Ohmer Elia Bruni Dieuwke Hupkes AI4CE 31 6 0 18 Apr 2024
The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models Giwon Hong Aryo Pradipta Gema Rohit Saxena Xiaotang Du Ping Nie ... Laura Perez-Beltrachini Max Ryabinin Xuanli He Clémentine Fourrier Pasquale Minervini LRM HILM 28 9 0 08 Apr 2024
The Minimum Information about CLinical Artificial Intelligence Checklist for Generative Modeling Research (MI-CLAIM-GEN) Brenda Y. Miao Irene Y. Chen C. Y. Williams Jaysón M. Davidson Augusto Garcia-Agundez ... Bin Yu Milena Gianfrancesco A. Butte Beau Norgeot Madhumita Sushil VLM 34 2 0 05 Mar 2024
LLMs for Targeted Sentiment in News Headlines: Exploring the Descriptive-Prescriptive Dilemma Jana Juros Laura Majer Jan Snajder 31 2 0 01 Mar 2024
Beyond prompt brittleness: Evaluating the reliability and consistency of political worldviews in LLMs Tanise Ceron Neele Falk Ana Barić Dmitry Nikolaev Sebastian Padó 22 15 0 27 Feb 2024
tinyBenchmarks: evaluating LLMs with fewer examples Felipe Maia Polo Lucas Weber Leshem Choshen Yuekai Sun Gongjun Xu Mikhail Yurochkin ELM 24 72 0 22 Feb 2024
The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis Miaoran Zhang Vagrant Gautam Mingyang Wang Jesujoba Oluwadara Alabi Xiaoyu Shen Dietrich Klakow Marius Mosbach 36 8 0 20 Feb 2024
Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST! Frank Wildenburg Michael Hanna Sandro Pezzelle 23 3 0 19 Feb 2024
Label-Efficient Model Selection for Text Generation Shir Ashury-Tahan Ariel Gera Benjamin Sznajder Leshem Choshen L. Ein-Dor Eyal Shnarch 28 4 0 12 Feb 2024
Homogenization Effects of Large Language Models on Human Creative Ideation Barrett R Anderson Jash Hemant Shah Max Kreminski 34 70 0 02 Feb 2024
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards Norah A. Alzahrani H. A. Alyahya Sultan Yazeed Alnumay Muhtasim Tahmid Shaykhah Alsubaie ... Saleh Soltan Nathan Scales Marie-Anne Lachaux Samuel R. Bowman Haidar Khan ELM 15 69 0 01 Feb 2024
K-QA: A Real-World Medical Q&A Benchmark Itay Manes Naama Ronn David Cohen Ran Ilan Ber Zehavi Horowitz-Kugler Gabriel Stanovsky LM&MA HILM AI4MH 20 10 0 25 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models Alexandre Ramé Nino Vieillard Léonard Hussenot Robert Dadashi Geoffrey Cideron Olivier Bachem Johan Ferret 102 92 0 22 Jan 2024
Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements Anton Voronov Lena Wolf Max Ryabinin 19 46 0 12 Jan 2024
Exploring the Reversal Curse and Other Deductive Logical Reasoning in BERT and GPT-Based Large Language Models Da Wu Jing Yang Kai Wang LRM 10 5 0 06 Dec 2023
Prompt Engineering a Prompt Engineer Qinyuan Ye Maxamed Axmed Reid Pryzant Fereshte Khani VLM LLMAG LRM 19 28 0 09 Nov 2023
Competence-Based Analysis of Language Models Adam Davies Jize Jiang Chengxiang Zhai ELM 21 4 0 01 Mar 2023
Instruction Induction: From Few Examples to Natural Language Task Descriptions Or Honovich Uri Shaham Samuel R. Bowman Omer Levy ELM LRM 110 133 0 22 May 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 315 8,261 0 28 Jan 2022
Measure and Improve Robustness in NLP Models: A Survey Xuezhi Wang Haohan Wang Diyi Yang 139 130 0 15 Dec 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization Victor Sanh Albert Webson Colin Raffel Stephen H. Bach Lintang Sutawika ... T. Bers Stella Biderman Leo Gao Thomas Wolf Alexander M. Rush LRM 205 1,651 0 15 Oct 2021
The Power of Scale for Parameter-Efficient Prompt Tuning Brian Lester Rami Al-Rfou Noah Constant VPVLM 278 3,784 0 18 Apr 2021