Scalable agent alignment via reward modeling: a research direction

19 November 2018

Papers citing "Scalable agent alignment via reward modeling: a research direction"

28 / 78 papers shown

Title
Defining and Characterizing Reward Hacking Joar Skalse Nikolaus H. R. Howe Dmitrii Krasheninnikov David M. Krueger 57 54 0 27 Sep 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans John J. Nay ELM AILaw 84 27 0 14 Sep 2022
The Alignment Problem from a Deep Learning Perspective Richard Ngo Lawrence Chan Sören Mindermann 52 181 0 30 Aug 2022
A Hazard Analysis Framework for Code Synthesis Large Language Models Heidy Khlaaf Pamela Mishkin Joshua Achiam Gretchen Krueger Miles Brundage ELM 17 28 0 25 Jul 2022
Self-critiquing models for assisting human evaluators William Saunders Catherine Yeh Jeff Wu Steven Bills Ouyang Long Jonathan Ward Jan Leike ALM ELM 27 279 0 12 Jun 2022
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning Xinran Liang Katherine Shu Kimin Lee Pieter Abbeel 16 58 0 24 May 2022
Adversarial Training for High-Stakes Reliability Daniel M. Ziegler Seraphina Nix Lawrence Chan Tim Bauman Peter Schmidt-Nielsen ... Noa Nabeshima Benjamin Weinstein-Raun D. Haas Buck Shlegeris Nate Thomas AAML 30 59 0 03 May 2022
Counterfactual harm Jonathan G. Richens R. Beard Daniel H. Thompson 21 27 0 27 Apr 2022
Mind the gap: Challenges of deep learning approaches to Theory of Mind Jaan Aru Aqeel Labash Oriol Corcoll Raul Vicente 20 26 0 30 Mar 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 308 11,915 0 04 Mar 2022
Myriad: a real-world testbed to bridge trajectory optimization and deep learning Nikolaus H. R. Howe Simon Dufort-Labbé Nitarshan Rajkumar Pierre-Luc Bacon 24 5 0 22 Feb 2022
Safe Deep RL in 3D Environments using Human Feedback Matthew Rahtz Vikrant Varma Ramana Kumar Zachary Kenton Shane Legg Jan Leike 24 4 0 20 Jan 2022
WebGPT: Browser-assisted question-answering with human feedback Reiichiro Nakano Jacob Hilton S. Balaji Jeff Wu Ouyang Long ... Gretchen Krueger Kevin Button Matthew Knight B. Chess John Schulman ALM RALM 43 1,195 0 17 Dec 2021
B-Pref: Benchmarking Preference-Based Reinforcement Learning Kimin Lee Laura M. Smith Anca Dragan Pieter Abbeel OffRL 27 92 0 04 Nov 2021
Recursively Summarizing Books with Human Feedback Jeff Wu Long Ouyang Daniel M. Ziegler Nissan Stiennon Ryan J. Lowe Jan Leike Paul Christiano ALM 21 294 0 22 Sep 2021
Offline Meta-Reinforcement Learning with Online Self-Supervision Vitchyr H. Pong Ashvin Nair Laura M. Smith Catherine Huang Sergey Levine OffRL 24 66 0 08 Jul 2021
Open Problems in Cooperative AI Allan Dafoe Edward Hughes Yoram Bachrach Tantum Collins Kevin R. McKee Joel Z. Leibo Kate Larson T. Graepel 21 199 0 15 Dec 2020
An overview of 11 proposals for building safe advanced AI Evan Hubinger AAML 6 23 0 04 Dec 2020
Inverse Constrained Reinforcement Learning Usman Anwar Shehryar Malik Alireza Aghasi Ali Ahmed 10 58 0 19 Nov 2020
Avoiding Tampering Incentives in Deep RL via Decoupled Approval J. Uesato Ramana Kumar Victoria Krakovna Tom Everitt Richard Ngo Shane Legg 21 14 0 17 Nov 2020
Learning to summarize from human feedback Nisan Stiennon Long Ouyang Jeff Wu Daniel M. Ziegler Ryan J. Lowe Chelsea Voss Alec Radford Dario Amodei Paul Christiano ALM 14 1,966 0 02 Sep 2020
AI Research Considerations for Human Existential Safety (ARCHES) Andrew Critch David M. Krueger 22 50 0 30 May 2020
AI safety: state of the field through quantitative lens Mislav Juric A. Sandic Mario Brčič 13 24 0 12 Feb 2020
SafeLife 1.0: Exploring Side Effects in Complex Environments Carroll L. Wainwright P. Eckersley 11 12 0 03 Dec 2019
AI safety via debate G. Irving Paul Christiano Dario Amodei 199 199 0 02 May 2018
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks Guy Katz Clark W. Barrett D. Dill Kyle D. Julian Mykel Kochenderfer AAML 226 1,835 0 03 Feb 2017
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles Balaji Lakshminarayanan Alexander Pritzel Charles Blundell UQCV BDL 270 5,660 0 05 Dec 2016
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning Y. Gal Zoubin Ghahramani UQCV BDL 276 9,136 0 06 Jun 2015