Goal Misgeneralization: Why Correct Specifications Aren't Enough For
Correct Goals

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

4 October 2022

Victoria Krakovna

Papers citing "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals"

15 / 15 papers shown

Title
An alignment safety case sketch based on debate Marie Davidsen Buhl Jacob Pfau Benjamin Hilton Geoffrey Irving 36 0 0 06 May 2025
Estimating the Probabilities of Rare Outputs in Language Models Gabriel Wu Jacob Hilton AAML UQCV 40 2 0 17 Oct 2024
Towards shutdownable agents via stochastic choice Elliott Thornley Alexander Roman Christos Ziakas Leyton Ho Louis Thomson 35 0 0 30 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations Teun van der Weij Felix Hofstätter Ollie Jaffe Samuel F. Brown Francis Rhys Ward ELM 37 23 0 11 Jun 2024
Open-Endedness is Essential for Artificial Superhuman Intelligence Edward Hughes Michael Dennis Jack Parker-Holder Feryal M. P. Behbahani Aditi Mavalankar Yuge Shi Tom Schaul Tim Rocktaschel LRM 32 18 0 06 Jun 2024
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks Rahul Ramesh Ekdeep Singh Lubana Mikail Khona Robert P. Dick Hidenori Tanaka CoGe 33 6 0 21 Nov 2023
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking Rose Hadshar 18 6 0 27 Oct 2023
Language Reward Modulation for Pretraining Reinforcement Learning Ademi Adeniji Amber Xie Carmelo Sferrazza Younggyo Seo Stephen James Pieter Abbeel 39 26 0 23 Aug 2023
Power-seeking can be probable and predictive for trained agents Victoria Krakovna János Kramár TDI 27 16 0 13 Apr 2023
Adversarial Cheap Talk Chris Xiaoxuan Lu Timon Willi Alistair Letcher Jakob N. Foerster AAML 16 17 0 20 Nov 2022
The Alignment Problem from a Deep Learning Perspective Richard Ngo Lawrence Chan Sören Mindermann 52 181 0 30 Aug 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 311 11,915 0 04 Mar 2022
AI safety via debate G. Irving Paul Christiano Dario Amodei 201 199 0 02 May 2018
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles Balaji Lakshminarayanan Alexander Pritzel Charles Blundell UQCV BDL 270 5,660 0 05 Dec 2016
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 278 2,888 0 15 Sep 2016