ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.01790
  4. Cited By
Goal Misgeneralization: Why Correct Specifications Aren't Enough For
  Correct Goals

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

4 October 2022
Rohin Shah
Vikrant Varma
Ramana Kumar
Mary Phuong
Victoria Krakovna
J. Uesato
Zachary Kenton
ArXivPDFHTML

Papers citing "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals"

15 / 15 papers shown
Title
An alignment safety case sketch based on debate
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
36
0
0
06 May 2025
Estimating the Probabilities of Rare Outputs in Language Models
Estimating the Probabilities of Rare Outputs in Language Models
Gabriel Wu
Jacob Hilton
AAML
UQCV
40
2
0
17 Oct 2024
Towards shutdownable agents via stochastic choice
Towards shutdownable agents via stochastic choice
Elliott Thornley
Alexander Roman
Christos Ziakas
Leyton Ho
Louis Thomson
35
0
0
30 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
37
23
0
11 Jun 2024
Open-Endedness is Essential for Artificial Superhuman Intelligence
Open-Endedness is Essential for Artificial Superhuman Intelligence
Edward Hughes
Michael Dennis
Jack Parker-Holder
Feryal M. P. Behbahani
Aditi Mavalankar
Yuge Shi
Tom Schaul
Tim Rocktaschel
LRM
32
18
0
06 Jun 2024
Compositional Capabilities of Autoregressive Transformers: A Study on
  Synthetic, Interpretable Tasks
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
33
6
0
21 Nov 2023
A Review of the Evidence for Existential Risk from AI via Misaligned
  Power-Seeking
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking
Rose Hadshar
18
6
0
27 Oct 2023
Language Reward Modulation for Pretraining Reinforcement Learning
Language Reward Modulation for Pretraining Reinforcement Learning
Ademi Adeniji
Amber Xie
Carmelo Sferrazza
Younggyo Seo
Stephen James
Pieter Abbeel
39
26
0
23 Aug 2023
Power-seeking can be probable and predictive for trained agents
Power-seeking can be probable and predictive for trained agents
Victoria Krakovna
János Kramár
TDI
27
16
0
13 Apr 2023
Adversarial Cheap Talk
Adversarial Cheap Talk
Chris Xiaoxuan Lu
Timon Willi
Alistair Letcher
Jakob N. Foerster
AAML
16
17
0
20 Nov 2022
The Alignment Problem from a Deep Learning Perspective
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
52
181
0
30 Aug 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
311
11,915
0
04 Mar 2022
AI safety via debate
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
201
199
0
02 May 2018
Simple and Scalable Predictive Uncertainty Estimation using Deep
  Ensembles
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan
Alexander Pritzel
Charles Blundell
UQCV
BDL
270
5,660
0
05 Dec 2016
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp
  Minima
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
278
2,888
0
15 Sep 2016
1