Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1811.07871
Cited By
Scalable agent alignment via reward modeling: a research direction
19 November 2018
Jan Leike
David M. Krueger
Tom Everitt
Miljan Martic
Vishal Maini
Shane Legg
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Scalable agent alignment via reward modeling: a research direction"
28 / 78 papers shown
Title
Defining and Characterizing Reward Hacking
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
57
54
0
27 Sep 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans
John J. Nay
ELM
AILaw
84
27
0
14 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
52
181
0
30 Aug 2022
A Hazard Analysis Framework for Code Synthesis Large Language Models
Heidy Khlaaf
Pamela Mishkin
Joshua Achiam
Gretchen Krueger
Miles Brundage
ELM
17
28
0
25 Jul 2022
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALM
ELM
27
279
0
12 Jun 2022
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning
Xinran Liang
Katherine Shu
Kimin Lee
Pieter Abbeel
16
58
0
24 May 2022
Adversarial Training for High-Stakes Reliability
Daniel M. Ziegler
Seraphina Nix
Lawrence Chan
Tim Bauman
Peter Schmidt-Nielsen
...
Noa Nabeshima
Benjamin Weinstein-Raun
D. Haas
Buck Shlegeris
Nate Thomas
AAML
30
59
0
03 May 2022
Counterfactual harm
Jonathan G. Richens
R. Beard
Daniel H. Thompson
21
27
0
27 Apr 2022
Mind the gap: Challenges of deep learning approaches to Theory of Mind
Jaan Aru
Aqeel Labash
Oriol Corcoll
Raul Vicente
20
26
0
30 Mar 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
308
11,915
0
04 Mar 2022
Myriad: a real-world testbed to bridge trajectory optimization and deep learning
Nikolaus H. R. Howe
Simon Dufort-Labbé
Nitarshan Rajkumar
Pierre-Luc Bacon
24
5
0
22 Feb 2022
Safe Deep RL in 3D Environments using Human Feedback
Matthew Rahtz
Vikrant Varma
Ramana Kumar
Zachary Kenton
Shane Legg
Jan Leike
24
4
0
20 Jan 2022
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALM
RALM
43
1,195
0
17 Dec 2021
B-Pref: Benchmarking Preference-Based Reinforcement Learning
Kimin Lee
Laura M. Smith
Anca Dragan
Pieter Abbeel
OffRL
27
92
0
04 Nov 2021
Recursively Summarizing Books with Human Feedback
Jeff Wu
Long Ouyang
Daniel M. Ziegler
Nissan Stiennon
Ryan J. Lowe
Jan Leike
Paul Christiano
ALM
21
294
0
22 Sep 2021
Offline Meta-Reinforcement Learning with Online Self-Supervision
Vitchyr H. Pong
Ashvin Nair
Laura M. Smith
Catherine Huang
Sergey Levine
OffRL
24
66
0
08 Jul 2021
Open Problems in Cooperative AI
Allan Dafoe
Edward Hughes
Yoram Bachrach
Tantum Collins
Kevin R. McKee
Joel Z. Leibo
Kate Larson
T. Graepel
21
199
0
15 Dec 2020
An overview of 11 proposals for building safe advanced AI
Evan Hubinger
AAML
6
23
0
04 Dec 2020
Inverse Constrained Reinforcement Learning
Usman Anwar
Shehryar Malik
Alireza Aghasi
Ali Ahmed
10
58
0
19 Nov 2020
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
J. Uesato
Ramana Kumar
Victoria Krakovna
Tom Everitt
Richard Ngo
Shane Legg
21
14
0
17 Nov 2020
Learning to summarize from human feedback
Nisan Stiennon
Long Ouyang
Jeff Wu
Daniel M. Ziegler
Ryan J. Lowe
Chelsea Voss
Alec Radford
Dario Amodei
Paul Christiano
ALM
14
1,966
0
02 Sep 2020
AI Research Considerations for Human Existential Safety (ARCHES)
Andrew Critch
David M. Krueger
22
50
0
30 May 2020
AI safety: state of the field through quantitative lens
Mislav Juric
A. Sandic
Mario Brčič
13
24
0
12 Feb 2020
SafeLife 1.0: Exploring Side Effects in Complex Environments
Carroll L. Wainwright
P. Eckersley
11
12
0
03 Dec 2019
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
199
199
0
02 May 2018
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks
Guy Katz
Clark W. Barrett
D. Dill
Kyle D. Julian
Mykel Kochenderfer
AAML
226
1,835
0
03 Feb 2017
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan
Alexander Pritzel
Charles Blundell
UQCV
BDL
270
5,660
0
05 Dec 2016
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Y. Gal
Zoubin Ghahramani
UQCV
BDL
276
9,136
0
06 Jun 2015
Previous
1
2