Suppressing Pink Elephants with Direct Principle Feedback

12 February 2024

Papers citing "Suppressing Pink Elephants with Direct Principle Feedback"

7 / 7 papers shown

Title
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements Jingyu Zhang Ahmed Elgohary Ahmed Magooda Daniel Khashabi Benjamin Van Durme 33 2 0 11 Oct 2024
KTO: Model Alignment as Prospect Theoretic Optimization Kawin Ethayarajh Winnie Xu Niklas Muennighoff Dan Jurafsky Douwe Kiela 150 437 0 02 Feb 2024
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models Avi Singh John D. Co-Reyes Rishabh Agarwal Ankesh Anand Piyush Patil ... Yamini Bansal Ethan Dyer Behnam Neyshabur Jascha Narain Sohl-Dickstein Noah Fiedel ALM LRM ReLM SyDa 144 143 0 11 Dec 2023
Who's Harry Potter? Approximate Unlearning in LLMs Ronen Eldan M. Russinovich MU MoMe 98 171 0 03 Oct 2023
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 217 495 0 28 Sep 2022
Offline RL for Natural Language Generation with Implicit Language Q Learning Charles Burton Snell Ilya Kostrikov Yi Su Mengjiao Yang Sergey Levine OffRL 116 101 0 05 Jun 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization Victor Sanh Albert Webson Colin Raffel Stephen H. Bach Lintang Sutawika ... T. Bers Stella Biderman Leo Gao Thomas Wolf Alexander M. Rush LRM 203 1,651 0 15 Oct 2021