Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.00595
Cited By
State of What Art? A Call for Multi-Prompt LLM Evaluation
31 December 2023
Moran Mizrahi
Guy Kaplan
Daniel Malkin
Rotem Dror
Dafna Shahaf
Gabriel Stanovsky
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"State of What Art? A Call for Multi-Prompt LLM Evaluation"
45 / 95 papers shown
Title
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios
Samuel Ackerman
Ella Rabinovich
E. Farchi
Ateret Anaby-Tavor
21
1
0
04 Aug 2024
Improving Minimum Bayes Risk Decoding with Multi-Prompt
David Heineman
Yao Dou
Wei-ping Xu
29
6
0
22 Jul 2024
Questionable practices in machine learning
Gavin Leech
Juan J. Vazquez
Misha Yagudin
Niclas Kupper
Laurence Aitchison
42
2
0
17 Jul 2024
Social Bias Evaluation for Large Language Models Requires Prompt Variations
Rem Hida
Masahiro Kaneko
Naoaki Okazaki
38
13
0
03 Jul 2024
Paraphrase Types Elicit Prompt Engineering Capabilities
Jan Philip Wahle
Terry Ruas
Yang Xu
Bela Gipp
29
5
0
28 Jun 2024
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Christoph Leiter
Steffen Eger
27
7
0
26 Jun 2024
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Deng Cai
Huayang Li
Tingchen Fu
Siheng Li
Weiwen Xu
...
Leyang Cui
Yan Wang
Lemao Liu
Taro Watanabe
Shuming Shi
KELM
26
2
0
24 Jun 2024
SEAM: A Stochastic Benchmark for Multi-Document Tasks
Gili Lior
Avi Caciularu
Arie Cattan
Shahar Levy
Ori Shapira
Gabriel Stanovsky
RALM
33
4
0
23 Jun 2024
An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
Shuoqi Sun
Shengyao Zhuang
Shuai Wang
Guido Zuccon
40
5
0
20 Jun 2024
ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models
Hwiyeol Jo
Hyunwoo Lee
Taiwoo Park
21
0
0
19 Jun 2024
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
Kyle Moore
Jesse Roberts
Thao Pham
Oseremhen Ewaleifoh
Doug Fisher
40
2
0
17 Jun 2024
KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs
Aihua Pei
Zehua Yang
Shunan Zhu
Ruoxi Cheng
Ju Jia
Lina Wang
29
1
0
16 Jun 2024
Evaluation and Continual Improvement for an Enterprise AI Assistant
Akash Maharaj
Kun Qian
Uttaran Bhattacharya
Sally Fang
Horia Galatanu
...
Rachel Hanessian
Nishant Kapoor
Ken Russell
Shivakumar Vaithyanathan
Yunyao Li
21
4
0
15 Jun 2024
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework
Olivier Binette
Jerome P. Reiter
28
0
0
14 Jun 2024
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
Andrew M. Bean
Simi Hellsten
Harry Mayne
Jabez Magomere
Ethan A. Chi
Ryan A. Chi
Scott A. Hale
Hannah Rose Kirk
ELM
LRM
34
6
0
10 Jun 2024
On the Worst Prompt Performance of Large Language Models
Bowen Cao
Deng Cai
Zhisong Zhang
Yuexian Zou
Wai Lam
ALM
LRM
25
5
0
08 Jun 2024
Efficient multi-prompt evaluation of LLMs
Felipe Maia Polo
Ronald Xu
Lucas Weber
Mírian Silva
Onkar Bhardwaj
Leshem Choshen
Allysson Flavio Melo de Oliveira
Yuekai Sun
Mikhail Yurochkin
37
17
0
27 May 2024
A Nurse is Blue and Elephant is Rugby: Cross Domain Alignment in Large Language Models Reveal Human-like Patterns
Asaf Yehudai
Taelin Karidi
Gabriel Stanovsky
Ariel Goldstein
Omri Abend
33
1
0
23 May 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
125
52
3
23 May 2024
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs
Sylvain Kouemo Ngassom
Arghavan Moradi Dakhel
Florian Tambon
Foutse Khomh
27
2
0
22 May 2024
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Melissa Ailem
Katerina Marazopoulou
Charlotte Siska
James Bono
51
13
0
25 Apr 2024
Stronger Random Baselines for In-Context Learning
Gregory Yauney
David M. Mimno
42
2
0
19 Apr 2024
Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?
Laura Majer
Jan Snajder
26
3
0
18 Apr 2024
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency
Xenia Ohmer
Elia Bruni
Dieuwke Hupkes
AI4CE
31
6
0
18 Apr 2024
The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models
Giwon Hong
Aryo Pradipta Gema
Rohit Saxena
Xiaotang Du
Ping Nie
...
Laura Perez-Beltrachini
Max Ryabinin
Xuanli He
Clémentine Fourrier
Pasquale Minervini
LRM
HILM
28
9
0
08 Apr 2024
The Minimum Information about CLinical Artificial Intelligence Checklist for Generative Modeling Research (MI-CLAIM-GEN)
Brenda Y. Miao
Irene Y. Chen
C. Y. Williams
Jaysón M. Davidson
Augusto Garcia-Agundez
...
Bin Yu
Milena Gianfrancesco
A. Butte
Beau Norgeot
Madhumita Sushil
VLM
34
2
0
05 Mar 2024
LLMs for Targeted Sentiment in News Headlines: Exploring the Descriptive-Prescriptive Dilemma
Jana Juros
Laura Majer
Jan Snajder
31
2
0
01 Mar 2024
Beyond prompt brittleness: Evaluating the reliability and consistency of political worldviews in LLMs
Tanise Ceron
Neele Falk
Ana Barić
Dmitry Nikolaev
Sebastian Padó
22
15
0
27 Feb 2024
tinyBenchmarks: evaluating LLMs with fewer examples
Felipe Maia Polo
Lucas Weber
Leshem Choshen
Yuekai Sun
Gongjun Xu
Mikhail Yurochkin
ELM
24
72
0
22 Feb 2024
The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis
Miaoran Zhang
Vagrant Gautam
Mingyang Wang
Jesujoba Oluwadara Alabi
Xiaoyu Shen
Dietrich Klakow
Marius Mosbach
36
8
0
20 Feb 2024
Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the DUST!
Frank Wildenburg
Michael Hanna
Sandro Pezzelle
23
3
0
19 Feb 2024
Label-Efficient Model Selection for Text Generation
Shir Ashury-Tahan
Ariel Gera
Benjamin Sznajder
Leshem Choshen
L. Ein-Dor
Eyal Shnarch
28
4
0
12 Feb 2024
Homogenization Effects of Large Language Models on Human Creative Ideation
Barrett R Anderson
Jash Hemant Shah
Max Kreminski
34
70
0
02 Feb 2024
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
Norah A. Alzahrani
H. A. Alyahya
Sultan Yazeed Alnumay
Muhtasim Tahmid
Shaykhah Alsubaie
...
Saleh Soltan
Nathan Scales
Marie-Anne Lachaux
Samuel R. Bowman
Haidar Khan
ELM
15
69
0
01 Feb 2024
K-QA: A Real-World Medical Q&A Benchmark
Itay Manes
Naama Ronn
David Cohen
Ran Ilan Ber
Zehavi Horowitz-Kugler
Gabriel Stanovsky
LM&MA
HILM
AI4MH
20
10
0
25 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
102
92
0
22 Jan 2024
Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements
Anton Voronov
Lena Wolf
Max Ryabinin
19
46
0
12 Jan 2024
Exploring the Reversal Curse and Other Deductive Logical Reasoning in BERT and GPT-Based Large Language Models
Da Wu
Jing Yang
Kai Wang
LRM
10
5
0
06 Dec 2023
Prompt Engineering a Prompt Engineer
Qinyuan Ye
Maxamed Axmed
Reid Pryzant
Fereshte Khani
VLM
LLMAG
LRM
19
28
0
09 Nov 2023
Competence-Based Analysis of Language Models
Adam Davies
Jize Jiang
Chengxiang Zhai
ELM
21
4
0
01 Mar 2023
Instruction Induction: From Few Examples to Natural Language Task Descriptions
Or Honovich
Uri Shaham
Samuel R. Bowman
Omer Levy
ELM
LRM
110
133
0
22 May 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Measure and Improve Robustness in NLP Models: A Survey
Xuezhi Wang
Haohan Wang
Diyi Yang
139
130
0
15 Dec 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
205
1,651
0
15 Oct 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
278
3,784
0
18 Apr 2021
Previous
1
2