What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

18 June 2024

Federico Errica

Roberto Bifulco

Papers citing "What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering"

16 / 16 papers shown

Title
Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text Jennifer Healey Laurie Byrum Md Nadeem Akhtar Surabhi Bhargava Moumita Sinha 25 0 0 05 May 2025
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts Hanhua Hong Chenghao Xiao Yang Wang Y. Liu Wenge Rong Chenghua Lin 21 0 0 29 Apr 2025
LLMs as Data Annotators: How Close Are We to Human Performance Muhammad Uzair Ul Haq Davide Rigoni A. Sperduti 17 0 0 21 Apr 2025
Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing Jihyun Janice Ahn Wenpeng Yin SILM LRM 53 1 0 02 Apr 2025
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions Yubo Li Yidi Miao Xueying Ding Ramayya Krishnan R. Padman 32 0 0 28 Mar 2025
GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation Tao Feng Yihang Sun Jiaxuan You 43 0 0 16 Mar 2025
Adaptive Prompting: Ad-hoc Prompt Composition for Social Bias Detection Maximilian Spliethover Tim Knebler Fabian Fumagalli Maximilian Muschalik Barbara Hammer Eyke Hüllermeier Henning Wachsmuth 94 1 0 10 Feb 2025
Linguistic Features Extracted by GPT-4 Improve Alzheimer's Disease Detection based on Spontaneous Speech Jonathan Heitz Gerold Schneider Nicolas Langer LM&MA 76 0 0 20 Dec 2024
LLMs: A Game-Changer for Software Engineers? Md Asraful Haque LLMAG SyDa 21 0 0 01 Nov 2024
Evaluating Gender Bias of LLMs in Making Morality Judgements Divij Bajaj Yuanyuan Lei Jonathan Tong Ruihong Huang 26 1 0 13 Oct 2024
Estimating Contribution Quality in Online Deliberations Using a Large Language Model Lodewijk Gelauff Mohak Goyal Bhargav Dindukurthi Ashish Goel Alice Siu 27 0 0 21 Aug 2024
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text Sher Badshah Hassan Sajjad ELM 26 8 0 17 Aug 2024
To Believe or Not to Believe Your LLM Yasin Abbasi-Yadkori Ilja Kuzborskij András György Csaba Szepesvári UQCV 53 14 0 04 Jun 2024
AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents Luca Gioacchini G. Siracusano D. Sanvito Kiril Gashteovski David Friede Roberto Bifulco Carolin (Haas) Lawrence ELM LLMAG 36 10 0 09 Apr 2024
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Jiawei Liu Chun Xia Yuyao Wang Lingming Zhang ELM ALM 161 388 0 02 May 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 313 8,261 0 28 Jan 2022