Safe RLHF: Safe Reinforcement Learning from Human Feedback

Safe RLHF: Safe Reinforcement Learning from Human Feedback

19 October 2023

Jiaming Ji

Papers citing "Safe RLHF: Safe Reinforcement Learning from Human Feedback"

10 / 60 papers shown

Title
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content Federico Bianchi James Y. Zou 32 4 0 21 Feb 2024
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models Jiong Wang Junlin Wu Muhao Chen Yevgeniy Vorobeychik Chaowei Xiao AAML 13 12 0 16 Nov 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher Youliang Yuan Wenxiang Jiao Wenxuan Wang Jen-tse Huang Pinjia He Shuming Shi Zhaopeng Tu SILM 56 231 0 12 Aug 2023
Can Large Language Models Be an Alternative to Human Evaluations? Cheng-Han Chiang Hung-yi Lee ALM LM&MA 209 559 0 03 May 2023
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 225 495 0 28 Sep 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai ... Nicholas Joseph Sam McCandlish C. Olah Jared Kaplan Jack Clark 218 441 0 23 Aug 2022
An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics Huan Yee Koh Jiaxin Ju Ming Liu Shirui Pan 73 120 0 03 Jul 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,730 0 04 Mar 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization Victor Sanh Albert Webson Colin Raffel Stephen H. Bach Lintang Sutawika ... T. Bers Stella Biderman Leo Gao Thomas Wolf Alexander M. Rush LRM 205 1,651 0 15 Oct 2021
AI safety via debate G. Irving Paul Christiano Dario Amodei 199 199 0 02 May 2018