Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01663
Cited By
Adversarial Training for High-Stakes Reliability
3 May 2022
Daniel M. Ziegler
Seraphina Nix
Lawrence Chan
Tim Bauman
Peter Schmidt-Nielsen
Tao R. Lin
Adam Scherlis
Noa Nabeshima
Benjamin Weinstein-Raun
D. Haas
Buck Shlegeris
Nate Thomas
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Adversarial Training for High-Stakes Reliability"
14 / 14 papers shown
Title
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
73
5
0
02 Oct 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
13
76
0
25 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Edmund Mills
Shiye Su
Stuart J. Russell
Scott Emmons
43
7
0
20 Dec 2023
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
22
38
0
12 Dec 2023
Auditing large language models: a three-layered approach
Jakob Mokander
Jonas Schuett
Hannah Rose Kirk
Luciano Floridi
AILaw
MLAU
29
193
0
16 Feb 2023
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
49
181
0
30 Aug 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,881
0
04 Mar 2022
Analyzing Dynamic Adversarial Training Data in the Limit
Eric Wallace
Adina Williams
Robin Jia
Douwe Kiela
184
29
0
16 Oct 2021
Robust Feature-Level Adversaries are Interpretability Tools
Stephen Casper
Max Nadeau
Dylan Hadfield-Menell
Gabriel Kreiman
AAML
40
27
0
07 Oct 2021
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
173
272
0
28 Sep 2021
Tailor: Generating and Perturbing Text with Semantic Controls
Alexis Ross
Tongshuang Wu
Hao Peng
Matthew E. Peters
Matt Gardner
134
77
0
15 Jul 2021
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
98
227
0
15 Apr 2021
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
275
1,583
0
18 Sep 2019
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
199
199
0
02 May 2018
1