ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1906.01820
  4. Cited By
Risks from Learned Optimization in Advanced Machine Learning Systems

Risks from Learned Optimization in Advanced Machine Learning Systems

5 June 2019
Evan Hubinger
Chris van Merwijk
Vladimir Mikulik
Joar Skalse
Scott Garrabrant
ArXivPDFHTML

Papers citing "Risks from Learned Optimization in Advanced Machine Learning Systems"

50 / 108 papers shown
Title
Getting aligned on representational alignment
Getting aligned on representational alignment
Ilia Sucholutsky
Lukas Muttenthaler
Adrian Weller
Andi Peng
Andreea Bobu
...
Thomas Unterthiner
Andrew Kyle Lampinen
Klaus-Robert Muller
M. Toneva
Thomas L. Griffiths
61
74
0
18 Oct 2023
Conceptual Framework for Autonomous Cognitive Entities
Conceptual Framework for Autonomous Cognitive Entities
David Shapiro
Wangfan Li
Manuel Delaflor
Carlos Toxtli
41
1
0
03 Oct 2023
CoinRun: Solving Goal Misgeneralisation
CoinRun: Solving Goal Misgeneralisation
Stuart Armstrong
Alexandre Maranhao
Oliver Daniels-Koch
Ioannis Gkioulekas
Rebecca Gormann
LRM
32
0
0
28 Sep 2023
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
19
176
0
26 Sep 2023
Uncovering mesa-optimization algorithms in Transformers
Uncovering mesa-optimization algorithms in Transformers
J. Oswald
Eyvind Niklasson
Maximilian Schlegel
Seijin Kobayashi
Nicolas Zucchet
...
Mark Sandler
Blaise Agüera y Arcas
Max Vladymyrov
Razvan Pascanu
João Sacramento
24
53
0
11 Sep 2023
Taken out of context: On measuring situational awareness in LLMs
Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund
Asa Cooper Stickland
Mikita Balesni
Max Kaufmann
Meg Tong
Tomasz Korbak
Daniel Kokotajlo
Owain Evans
LLMAG
LRM
16
61
0
01 Sep 2023
Deception Abilities Emerged in Large Language Models
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
35
75
0
31 Jul 2023
Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent
  Problems in AI Alignment using Large-Language Models
Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent Problems in AI Alignment using Large-Language Models
S. Phelps
Rebecca E. Ranson
LLMAG
34
1
0
20 Jul 2023
Deceptive Alignment Monitoring
Deceptive Alignment Monitoring
Andres Carranza
Dhruv Pai
Rylan Schaeffer
Arnuv Tandon
Oluwasanmi Koyejo
37
7
0
20 Jul 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
  Choice Capabilities in Chinchilla
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Tom Lieberum
Matthew Rahtz
János Kramár
Neel Nanda
G. Irving
Rohin Shah
Vladimir Mikulik
26
100
0
18 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
44
118
0
06 Jul 2023
Still No Lie Detector for Language Models: Probing Empirical and
  Conceptual Roadblocks
Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks
B. Levinstein
Daniel A. Herrmann
17
54
0
30 Jun 2023
Pretraining task diversity and the emergence of non-Bayesian in-context
  learning for regression
Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression
Allan Raventós
Mansheej Paul
F. Chen
Surya Ganguli
27
70
0
26 Jun 2023
Inverse Scaling: When Bigger Isn't Better
Inverse Scaling: When Bigger Isn't Better
I. R. McKenzie
Alexander Lyzhov
Michael Pieler
Alicia Parrish
Aaron Mueller
...
Yuhui Zhang
Zhengping Zhou
Najoung Kim
Sam Bowman
Ethan Perez
25
126
0
15 Jun 2023
Incentivizing honest performative predictions with proper scoring rules
Incentivizing honest performative predictions with proper scoring rules
Caspar Oesterheld
Johannes Treutlein
Emery Cooper
Rubi Hudson
33
5
0
28 May 2023
Deep Learning and Ethics
Deep Learning and Ethics
Travis LaCroix
Simon J. D. Prince
FaML
24
0
0
24 May 2023
Eight Things to Know about Large Language Models
Eight Things to Know about Large Language Models
Sam Bowman
ALM
25
113
0
02 Apr 2023
Natural Selection Favors AIs over Humans
Natural Selection Favors AIs over Humans
Dan Hendrycks
23
31
0
28 Mar 2023
Conditioning Predictive Models: Risks and Strategies
Conditioning Predictive Models: Risks and Strategies
Evan Hubinger
Adam Jermyn
Johannes Treutlein
Rubi Hudson
Kate Woolverton
33
5
0
02 Feb 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
22
358
0
19 Dec 2022
Transformers learn in-context by gradient descent
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
30
429
0
15 Dec 2022
Misspecification in Inverse Reinforcement Learning
Misspecification in Inverse Reinforcement Learning
Joar Skalse
Alessandro Abate
25
22
0
06 Dec 2022
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Stephen Casper
K. Hariharan
Dylan Hadfield-Menell
AAML
18
11
0
18 Nov 2022
Ignore Previous Prompt: Attack Techniques For Language Models
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez
Ian Ribeiro
SILM
28
397
0
17 Nov 2022
Measuring Progress on Scalable Oversight for Large Language Models
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALM
ELM
28
122
0
04 Nov 2022
Scaling Laws for Reward Model Overoptimization
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
41
475
0
19 Oct 2022
Bridging the Gap between Artificial Intelligence and Artificial General
  Intelligence: A Ten Commandment Framework for Human-Like Intelligence
Bridging the Gap between Artificial Intelligence and Artificial General Intelligence: A Ten Commandment Framework for Human-Like Intelligence
Ananta Nair
F. Kashani
31
2
0
17 Oct 2022
Goal Misgeneralization: Why Correct Specifications Aren't Enough For
  Correct Goals
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Rohin Shah
Vikrant Varma
Ramana Kumar
Mary Phuong
Victoria Krakovna
J. Uesato
Zachary Kenton
34
68
0
04 Oct 2022
The Alignment Problem from a Deep Learning Perspective
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
54
183
0
30 Aug 2022
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
Julian Michael
Ari Holtzman
Alicia Parrish
Aaron Mueller
Alex Jinpeng Wang
...
Divyam Madaan
Nikita Nangia
Richard Yuanzhe Pang
Jason Phang
Sam Bowman
22
37
0
26 Aug 2022
The Linguistic Blind Spot of Value-Aligned Agency, Natural and
  Artificial
The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial
Travis LaCroix
28
3
0
02 Jul 2022
Parametrically Retargetable Decision-Makers Tend To Seek Power
Parametrically Retargetable Decision-Makers Tend To Seek Power
Alexander Matt Turner
Prasad Tadepalli
14
18
0
27 Jun 2022
Worldwide AI Ethics: a review of 200 guidelines and recommendations for
  AI governance
Worldwide AI Ethics: a review of 200 guidelines and recommendations for AI governance
N. Corrêa
Camila Galvão
J. Santos
C. Pino
Edson Pontes Pinto
...
Diogo Massmann
Rodrigo Mambrini
Luiza Galvao
Edmund Terem
Nythamar Fernandes de Oliveira
31
89
0
23 Jun 2022
Actionable Guidance for High-Consequence AI Risk Management: Towards
  Standards Addressing AI Catastrophic Risks
Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks
Anthony M. Barrett
Dan Hendrycks
Jessica Newman
Brandie Nonnecke
SILM
29
11
0
17 Jun 2022
Is Power-Seeking AI an Existential Risk?
Is Power-Seeking AI an Existential Risk?
Joseph Carlsmith
ELM
23
81
0
16 Jun 2022
X-Risk Analysis for AI Research
X-Risk Analysis for AI Research
Dan Hendrycks
Mantas Mazeika
33
67
0
13 Jun 2022
Researching Alignment Research: Unsupervised Analysis
Researching Alignment Research: Unsupervised Analysis
Jan H. Kirchner
Logan Smith
Jacques Thibodeau
Kyle McDonell
Laria Reynolds
23
6
0
06 Jun 2022
Adversarial Training for High-Stakes Reliability
Adversarial Training for High-Stakes Reliability
Daniel M. Ziegler
Seraphina Nix
Lawrence Chan
Tim Bauman
Peter Schmidt-Nielsen
...
Noa Nabeshima
Benjamin Weinstein-Raun
D. Haas
Buck Shlegeris
Nate Thomas
AAML
32
59
0
03 May 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
69
800
0
14 Apr 2022
A Modern Self-Referential Weight Matrix That Learns to Modify Itself
A Modern Self-Referential Weight Matrix That Learns to Modify Itself
Kazuki Irie
Imanol Schlag
Róbert Csordás
Jürgen Schmidhuber
14
26
0
11 Feb 2022
The Effects of Reward Misspecification: Mapping and Mitigating
  Misaligned Models
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan
Kush S. Bhatia
Jacob Steinhardt
44
168
0
10 Jan 2022
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP
  Systems Fail
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail
Sam Bowman
OffRL
22
45
0
15 Oct 2021
Extending Environments To Measure Self-Reflection In Reinforcement
  Learning
Extending Environments To Measure Self-Reflection In Reinforcement Learning
S. Alexander
Michael Castaneda
K. Compher
Oscar Martinez
16
6
0
13 Oct 2021
Unsolved Problems in ML Safety
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
186
273
0
28 Sep 2021
Goal Misgeneralization in Deep Reinforcement Learning
Goal Misgeneralization in Deep Reinforcement Learning
L. Langosco
Jack Koch
Lee D. Sharkey
J. Pfau
Laurent Orseau
David M. Krueger
22
78
0
28 May 2021
Alignment of Language Agents
Alignment of Language Agents
Zachary Kenton
Tom Everitt
Laura Weidinger
Iason Gabriel
Vladimir Mikulik
G. Irving
22
157
0
26 Mar 2021
An overview of 11 proposals for building safe advanced AI
An overview of 11 proposals for building safe advanced AI
Evan Hubinger
AAML
19
23
0
04 Dec 2020
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
J. Uesato
Ramana Kumar
Victoria Krakovna
Tom Everitt
Richard Ngo
Shane Legg
26
14
0
17 Nov 2020
REALab: An Embedded Perspective on Tampering
REALab: An Embedded Perspective on Tampering
Ramana Kumar
J. Uesato
Richard Ngo
Tom Everitt
Victoria Krakovna
Shane Legg
14
10
0
17 Nov 2020
Achilles Heels for AGI/ASI via Decision Theoretic Adversaries
Achilles Heels for AGI/ASI via Decision Theoretic Adversaries
Stephen L. Casper
6
4
0
12 Oct 2020
Previous
123
Next