Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.00595
Cited By
State of What Art? A Call for Multi-Prompt LLM Evaluation
31 December 2023
Moran Mizrahi
Guy Kaplan
Daniel Malkin
Rotem Dror
Dafna Shahaf
Gabriel Stanovsky
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"State of What Art? A Call for Multi-Prompt LLM Evaluation"
50 / 95 papers shown
Title
Cooking Up Creativity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations
Moran Mizrahi
Chen Shani
Gabriel Stanovsky
Dan Jurafsky
Dafna Shahaf
29
0
0
29 Apr 2025
How Effective are Generative Large Language Models in Performing Requirements Classification?
Waad Alhoshan
Alessio Ferrari
Liping Zhao
20
0
0
23 Apr 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
Anyi Wang
Raoyuan Zhao
Florian Eichin
Barbara Plank
30
0
0
22 Apr 2025
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
Jaime Raldua Veuthey
Zainab Ali Majid
Suhas Hariharan
Jacob Haimes
ELM
26
0
0
18 Apr 2025
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
Dieuwke Hupkes
Nikolay Bogoychev
48
0
0
14 Apr 2025
Synthetic Fluency: Hallucinations, Confabulations, and the Creation of Irish Words in LLM-Generated Translations
Sheila Castilho
Zoe Fitzsimmons
Claire Holton
Aoife Mc Donagh
26
0
0
10 Apr 2025
Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability
Jennifer Haase
P. Hanel
Sebastian Pokutta
ALM
LRM
60
0
0
10 Apr 2025
Towards LLMs Robustness to Changes in Prompt Format Styles
Lilian Ngweta
Kiran Kate
Jason Tsay
Yara Rizk
AAML
VLM
27
0
0
09 Apr 2025
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff
Erblina Purellku
Jakob Hackstein
Jonas Loos
Leo Pinetzki
Lorenz Hufe
AAML
28
0
0
07 Apr 2025
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study
Aryan Agrawal
Lisa Alazraki
Shahin Honarvar
Marek Rei
49
0
0
03 Apr 2025
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Yubo Li
Yidi Miao
Xueying Ding
Ramayya Krishnan
R. Padman
37
0
0
28 Mar 2025
ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models
Alexey Karev
Dong Xu
43
0
0
18 Mar 2025
Aligned Probing: Relating Toxic Behavior and Model Internals
Andreas Waldis
Vagrant Gautam
Anne Lauscher
Dietrich Klakow
Iryna Gurevych
33
0
0
17 Mar 2025
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Junjie Chen
X. Liu
Subin Huang
Linfeng Zhang
Hang Yu
58
0
0
15 Mar 2025
Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results
Peter Fettke
Constantin Houy
ELM
35
0
0
14 Mar 2025
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Eliya Habba
Ofir Arviv
Itay Itzhak
Yotam Perlitz
Elron Bandel
Leshem Choshen
Michal Shmueli-Scheuer
Gabriel Stanovsky
64
1
0
03 Mar 2025
Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models
Cheng-Kuang Wu
Zhi Rui Tam
Chieh-Yen Lin
Yun-Nung Chen
Hung-yi Lee
62
0
0
03 Mar 2025
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
Tingchen Fu
Fazl Barez
AAML
58
0
0
03 Mar 2025
ECLeKTic: a Novel Challenge Set for Evaluation of Cross-Lingual Knowledge Transfer
Omer Goldman
Uri Shaham
Dan Malkin
Sivan Eiger
Avinatan Hassidim
...
Shruti Rijhwani
Laura Rimell
Idan Szpektor
Reut Tsarfaty
Matan Eyal
42
3
0
28 Feb 2025
SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models
Grigor Nalbandyan
Rima Shahbazyan
Evelina Bakhturina
ELM
33
0
0
28 Feb 2025
Human Preferences in Large Language Model Latent Space: A Technical Analysis on the Reliability of Synthetic Data in Voting Outcome Prediction
Sarah Ball
Simeon Allmendinger
Frauke Kreuter
Niklas Kühl
44
0
0
22 Feb 2025
From Selection to Generation: A Survey of LLM-based Active Learning
Yu Xia
Subhojyoti Mukherjee
Zhouhang Xie
Junda Wu
Xintong Li
...
Namyong Park
T. Nguyen
Jiebo Luo
Ryan Rossi
Julian McAuley
53
0
0
17 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez Llorca
ELM
122
1
0
10 Feb 2025
Evalita-LLM: Benchmarking Large Language Models on Italian
Bernardo Magnini
Roberto Zanoli
Michele Resta
Martin Cimmino
Paolo Albano
Marco Madeddu
V. Patti
53
1
0
04 Feb 2025
MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models
Zhongpu Chen
Y. Liu
Long Shi
Zhi-Jie Wang
Xingyan Chen
Yu Zhao
Fuji Ren
41
0
0
28 Jan 2025
LCTG Bench: LLM Controlled Text Generation Benchmark
K. K.
Masato Mita
Peinan Zhang
S. Sasaki
Ryosuke Ishigami
Naoaki Okazaki
55
0
0
28 Jan 2025
Personalizing Education through an Adaptive LMS with Integrated LLMs
Kyle Spriggs
Meng Cheng Lau
Kalpdrum Passi
AI4Ed
48
0
0
24 Jan 2025
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
92
3
0
12 Dec 2024
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
Leixin Zhang
Steffen Eger
Yinjie Cheng
Weihe Zhai
Jonas Belouadi
Christoph Leiter
Simone Paolo Ponzetto
Fahimeh Moafian
Zhixue Zhao
MLLM
72
1
0
03 Dec 2024
SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts
Aihua Pei
Zehua Yang
Shunan Zhu
Ruoxi Cheng
Ju Jia
AAML
61
2
0
01 Dec 2024
Explaining GPT-4's Schema of Depression Using Machine Behavior Analysis
Adithya V Ganesan
Vasudha Varadarajan
Yash Kumar Lal
Veerle C. Eijsbroek
Katarina Kjell
...
Elizabeth C. Stade
J. Eichstaedt
Ryan L. Boyd
H. A. Schwartz
Lucie Flek
AI4MH
59
0
0
21 Nov 2024
The Interaction Layer: An Exploration for Co-Designing User-LLM Interactions in Parental Wellbeing Support Systems
Sruthi Viswanathan
Seray Ibrahim
Ravi Shankar
Reuben Binns
Max Van Kleek
Petr Slovák
55
1
0
02 Nov 2024
Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data
Anup Shirgaonkar
Nikhil Pandey
Nazmiye Ceren Abay
Tolga Aktas
Vijay Aski
ALM
SyDa
24
0
0
24 Oct 2024
How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang
Wei-Ye Zhao
Steffen Eger
65
4
0
24 Oct 2024
BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
Wenkai Li
Jiarui Liu
Andy Liu
Xuhui Zhou
Mona Diab
Maarten Sap
44
5
0
21 Oct 2024
LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks
Akshara Prabhakar
Yuanzhi Li
Karthik Narasimhan
Sham Kakade
Eran Malach
Samy Jelassi
MoMe
21
1
0
16 Oct 2024
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Jingming Zhuo
S. Zhang
Xinyu Fang
Haodong Duan
Dahua Lin
Kai Chen
15
17
0
16 Oct 2024
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers
Lorenzo Pacchiardi
Marko Tesic
Lucy G. Cheke
José Hernández Orallo
31
3
0
15 Oct 2024
Eliciting Textual Descriptions from Representations of Continuous Prompts
Dana Ramati
Daniela Gottesman
Mor Geva
24
0
0
15 Oct 2024
A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies
Yen-Hsiang Wang
Feng-Dian Su
Tzu-Yu Yeh
Yao-Chung Fan
RALM
AILaw
11
0
0
15 Oct 2024
Skill Learning Using Process Mining for Large Language Model Plan Generation
Andrei Cosmin Redis
M. Sani
Bahram Zarrin
Andrea Burattin
13
0
0
14 Oct 2024
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz
Kartik Mehta
Yu-Hsiang Lin
Haw-Shiuan Chang
Shereen Oraby
Sijia Liu
Vivek Subramanian
Tagyoung Chung
Mohit Bansal
Nanyun Peng
48
7
0
09 Oct 2024
POSIX: A Prompt Sensitivity Index For Large Language Models
Anwoy Chatterjee
H. S. V. N. S. K. Renduchintala
S. Bhatia
Tanmoy Chakraborty
AAML
11
6
0
03 Oct 2024
A Survey on the Honesty of Large Language Models
Siheng Li
Cheng Yang
Taiqiang Wu
Chufan Shi
Yuji Zhang
...
Jie Zhou
Yujiu Yang
Ngai Wong
Xixin Wu
Wai Lam
HILM
27
4
0
27 Sep 2024
The Lou Dataset -- Exploring the Impact of Gender-Fair Language in German Text Classification
Andreas Waldis
Joel Birrer
Anne Lauscher
Iryna Gurevych
23
1
0
26 Sep 2024
SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation
Maying Shen
Nadine Chang
Sifei Liu
Jose M. Alvarez
26
0
0
20 Sep 2024
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido
Roser Morante
Julio Gonzalo
Guillermo Marco
Jorge Carrillo-de-Albornoz
...
Enrique Amigó
Andrés Fernández
Alejandro Benito-Santos
Adrián Ghajari Espinosa
Victor Fresno
ELM
39
0
0
19 Sep 2024
Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation
Yihang Zheng
Bo-wen Li
Zhenghao Lin
Yi Luo
Xuanhe Zhou
Chen Lin
Jinsong Su
Guoliang Li
Shifu Li
ELM
41
1
0
05 Sep 2024
Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs
Maxim Ifergan
Leshem Choshen
Roee Aharoni
Idan Szpektor
Omri Abend
HILM
35
3
0
20 Aug 2024
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Zhi Rui Tam
Cheng-Kuang Wu
Yi-Lin Tsai
Chieh-Yen Lin
Hung-yi Lee
Yun-Nung Chen
22
24
0
05 Aug 2024
1
2
Next