It Takes Two: On the Seamlessness between Reward and Policy Model in
RLHF

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

12 June 2024

Xinyu Yang

Papers citing "It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF"

7 / 7 papers shown

Title
Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models Alessio Galatolo Zhenbang Dai Katie Winkle Meriem Beloucif 45 0 0 05 Mar 2025
Sycophancy in Large Language Models: Causes and Mitigations Lars Malmqvist 56 6 0 22 Nov 2024
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 225 495 0 28 Sep 2022
Teaching language models to support answers with verified quotes Jacob Menick Maja Trebacz Vladimir Mikulik John Aslanides Francis Song ... Mia Glaese Susannah Young Lucy Campbell-Gillingham G. Irving Nat McAleese ELM RALM 220 255 0 21 Mar 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 275 1,561 0 18 Sep 2019
Dialogue Learning With Human-In-The-Loop Jiwei Li Alexander H. Miller S. Chopra MarcÁurelio Ranzato Jason Weston OffRL 216 132 0 29 Nov 2016