Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

20 October 2023

Papers citing "Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning"

25 / 25 papers shown

Title
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages Dieuwke Hupkes Nikolay Bogoychev 46 0 0 14 Apr 2025
On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation Jirui Qi Raquel Fernández Arianna Bisazza RALM 56 0 0 01 Apr 2025
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness Tingchen Fu Fazl Barez AAML 53 0 0 03 Mar 2025
Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs Jungsoo Park Junmo Kang Gabriel Stanovsky Alan Ritter 48 0 0 26 Feb 2025
SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts Aihua Pei Zehua Yang Shunan Zhu Ruoxi Cheng Ju Jia AAML 61 0 0 01 Dec 2024
KGPA: Robustness Evaluation for Large Language Models via Cross-Domain Knowledge Graphs Aihua Pei Zehua Yang Shunan Zhu Ruoxi Cheng Ju Jia Lina Wang 24 0 0 16 Jun 2024
Quantifying Variance in Evaluation Benchmarks Lovish Madaan Aaditya K. Singh Rylan Schaeffer Andrew Poulton Sanmi Koyejo Pontus Stenetorp Sharan Narang Dieuwke Hupkes 27 4 0 14 Jun 2024
Efficient multi-prompt evaluation of LLMs Felipe Maia Polo Ronald Xu Lucas Weber Mírian Silva Onkar Bhardwaj Leshem Choshen Allysson Flavio Melo de Oliveira Yuekai Sun Mikhail Yurochkin 26 17 0 27 May 2024
Is Temperature the Creativity Parameter of Large Language Models? Max Peeperkorn Tom Kouwenhoven Daniel G. Brown Anna K. Jordanous 26 6 0 01 May 2024
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks Melissa Ailem Katerina Marazopoulou Charlotte Siska James Bono 51 13 0 25 Apr 2024
Sample Design Engineering: An Empirical Study of What Makes Good Downstream Fine-Tuning Samples for LLMs Biyang Guo He Wang Wenyilin Xiao Hong Chen Zhuxin Lee Songqiao Han Hailiang Huang 19 2 0 19 Apr 2024
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency Xenia Ohmer Elia Bruni Dieuwke Hupkes AI4CE 23 6 0 18 Apr 2024
tinyBenchmarks: evaluating LLMs with fewer examples Felipe Maia Polo Lucas Weber Leshem Choshen Yuekai Sun Gongjun Xu Mikhail Yurochkin ELM 16 72 0 22 Feb 2024
On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices Branislav Pecher Ivan Srba M. Bieliková 42 3 0 20 Feb 2024
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards Norah A. Alzahrani H. A. Alyahya Sultan Yazeed Alnumay Muhtasim Tahmid Shaykhah Alsubaie ... Saleh Soltan Nathan Scales Marie-Anne Lachaux Samuel R. Bowman Haidar Khan ELM 7 69 0 01 Feb 2024
MERA: A Comprehensive LLM Evaluation in Russian Alena Fenogenova Artem Chervyakov Nikita Martynov Anastasia Kozlova Maria Tikhonova ... Nikita Savushkin Polina Mikhailova Denis Dimitrov Alexander Panchenko Sergey Markov ELM 15 10 0 09 Jan 2024
State of What Art? A Call for Multi-Prompt LLM Evaluation Moran Mizrahi Guy Kaplan Daniel Malkin Rotem Dror Dafna Shahaf Gabriel Stanovsky ELM 8 123 0 31 Dec 2023
WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models Youssef Benchekroun Megi Dervishi Mark Ibrahim Jean-Baptiste Gaya Xavier Martinet Grégoire Mialon Thomas Scialom Emmanuel Dupoux Dieuwke Hupkes Pascal Vincent LRM 17 6 0 27 Nov 2023
Measuring the Robustness of NLP Models to Domain Shifts Nitay Calderon Naveh Porat Eyal Ben-David Alexander Chapanin Zorik Gekhman Nadav Oved Vitaly Shalumov Roi Reichart 10 6 0 31 May 2023
State-of-the-art generalisation research in NLP: A taxonomy and review Dieuwke Hupkes Mario Giulianelli Verna Dankers Mikel Artetxe Yanai Elazar ... Leila Khalatbari Maria Ryskina Rita Frieske Ryan Cotterell Zhijing Jin 103 91 0 06 Oct 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts Stephen H. Bach Victor Sanh Zheng-Xin Yong Albert Webson Colin Raffel ... Khalid Almubarak Xiangru Tang Dragomir R. Radev Mike Tian-Jian Jiang Alexander M. Rush VLM 212 335 0 02 Feb 2022
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity Yao Lu Max Bartolo Alastair Moore Sebastian Riedel Pontus Stenetorp AILaw LRM 274 882 0 18 Apr 2021
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets Mor Geva Yoav Goldberg Jonathan Berant 230 306 0 21 Aug 2019
Hypothesis Only Baselines in Natural Language Inference Adam Poliak Jason Naradowsky Aparajita Haldar Rachel Rudinger Benjamin Van Durme 187 574 0 02 May 2018