Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2209.07858
Cited By
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
23 August 2022
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
Saurav Kadavath
Benjamin Mann
Ethan Perez
Nicholas Schiefer
Kamal Ndousse
Andy Jones
Sam Bowman
Anna Chen
Tom Conerly
Nova Dassarma
Dawn Drain
Nelson Elhage
S. E. Showk
Stanislav Fort
Zac Hatfield-Dodds
T. Henighan
Danny Hernandez
Tristan Hume
Josh Jacobson
Scott Johnston
Shauna Kravec
Catherine Olsson
Sam Ringer
Eli Tran-Johnson
Dario Amodei
Tom B. Brown
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned"
5 / 5 papers shown
Title
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
205
8,441
0
04 Mar 2022
Analyzing Dynamic Adversarial Training Data in the Limit
Eric Wallace
Adina Williams
Robin Jia
Douwe Kiela
115
26
0
16 Oct 2021
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
145
156
0
15 Sep 2021
Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models
Alex Tamkin
Miles Brundage
Jack Clark
Deep Ganguli
ELM
AILaw
118
206
0
04 Feb 2021
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
SILM
MLAU
185
1,386
0
14 Dec 2020
1