AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

26 June 2024

Papers citing "AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations"

4 / 4 papers shown

Title
Towards Understanding Sycophancy in Language Models Mrinank Sharma Meg Tong Tomasz Korbak D. Duvenaud Amanda Askell ... Oliver Rausch Nicholas Schiefer Da Yan Miranda Zhang Ethan Perez 209 178 0 20 Oct 2023
Understanding the Effects of RLHF on LLM Generalisation and Diversity Robert Kirk Ishita Mediratta Christoforos Nalmpantis Jelena Luketina Eric Hambro Edward Grefenstette Roberta Raileanu AI4CE ALM 97 121 0 10 Oct 2023
From plane crashes to algorithmic harm: applicability of safety engineering frameworks for responsible ML Shalaleh Rismani Renee Shelby A. Smart Edgar W. Jatho Joshua A. Kroll AJung Moon Negar Rostamzadeh 27 36 0 06 Oct 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022