Predictions from language models for multiple-choice tasks are not
robust under variation of scoring methods

Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods

1 March 2024

Polina Tsvilodub

Papers citing "Predictions from language models for multiple-choice tasks are not robust under variation of scoring methods"

14 / 14 papers shown

Title
A review of faithfulness metrics for hallucination assessment in Large Language Models Ben Malin Tatiana Kalganova Nikoloas Boulgouris HILM 59 2 0 03 Jan 2025
ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs Hua Shen Tiffany Knearem Reshmi Ghosh Yu-Ju Yang Tanushree Mitra Yun Huang Yun Huang 50 0 0 15 Sep 2024
Compare without Despair: Reliable Preference Evaluation with Generation Separability Sayan Ghosh Tejas Srinivasan Swabha Swayamdipta 35 2 0 02 Jul 2024
Bayesian Statistical Modeling with Predictors from LLMs Michael Franke Polina Tsvilodub Fausto Carcassi 34 4 0 13 Jun 2024
Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process Ermo Hua Biqing Qi Kaiyan Zhang Yue Yu Ning Ding Xingtai Lv Kai Tian Bowen Zhou 32 3 0 20 May 2024
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think Xinpeng Wang Chengzhi Hu Bolei Ma Paul Röttger Barbara Plank OOD 24 6 0 12 Apr 2024
Auxiliary task demands mask the capabilities of smaller language models Jennifer Hu Michael C. Frank ELM 29 25 0 03 Apr 2024
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan J. Gehrke Eric Horvitz ... Scott M. Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang ELM AI4MH AI4CE ALM 251 2,232 0 22 Mar 2023
Syntactic Surprisal From Neural Models Predicts, But Underestimates, Human Processing Difficulty From Syntactic Ambiguities Suhas Arehalli Brian Dillon Tal Linzen 26 36 0 21 Oct 2022
Using cognitive psychology to understand GPT-3 Marcel Binz Eric Schulz ELM LLMAG 242 439 0 21 Jun 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,909 0 04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 315 8,448 0 28 Jan 2022
Reducing conversational agents' overconfidence through linguistic calibration Sabrina J. Mielke Arthur Szlam Emily Dinan Y-Lan Boureau 209 153 0 30 Dec 2020
Hypothesis Only Baselines in Natural Language Inference Adam Poliak Jason Naradowsky Aparajita Haldar Rachel Rudinger Benjamin Van Durme 190 576 0 02 May 2018