Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1906.01820
Cited By
Risks from Learned Optimization in Advanced Machine Learning Systems
5 June 2019
Evan Hubinger
Chris van Merwijk
Vladimir Mikulik
Joar Skalse
Scott Garrabrant
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Risks from Learned Optimization in Advanced Machine Learning Systems"
50 / 108 papers shown
Title
Self Rewarding Self Improving
Toby Simonds
Kevin Lopez
Akira Yoshiyama
Dominique Garmier
ReLM
ALM
LRM
38
0
0
12 May 2025
The Steganographic Potentials of Language Models
Artem Karpov
Tinuade Adeleke
Seong Hah Cho
Natalia Perez-Campanero
32
0
0
06 May 2025
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
38
0
0
06 May 2025
AI Awareness
Xianrui Li
Haoyuan Shi
Rongwu Xu
Wei Xu
54
0
0
25 Apr 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
44
0
0
08 Mar 2025
In-context learning of evolving data streams with tabular foundational models
Afonso Lourenço
João Gama
Eric P. Xing
Goreti Marreiros
61
0
0
24 Feb 2025
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
94
5
0
16 Feb 2025
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks
Hieu Minh "Jord" Nguyen
LM&MA
LRM
54
0
0
10 Feb 2025
Barriers and Pathways to Human-AI Alignment: A Game-Theoretic Approach
Aran Nayebi
74
1
0
09 Feb 2025
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Marc Carauleanu
Michael Vaiana
Judd Rosenblatt
Cameron Berg
Diogo Schwerz de Lucena
68
0
0
20 Dec 2024
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Cameron Tice
Philipp Alexander Kreer
Nathan Helm-Burger
Prithviraj Singh Shahani
Fedor Ryzhenkov
Jacob Haimes
Felix Hofstätter
Teun van der Weij
82
1
0
02 Dec 2024
The Two-Hop Curse: LLMs trained on A
→
\rightarrow
→
B, B
→
\rightarrow
→
C fail to learn A
→
\rightarrow
→
C
Mikita Balesni
Tomek Korbak
Owain Evans
ReLM
LRM
79
0
0
25 Nov 2024
Pretrained transformer efficiently learns low-dimensional target functions in-context
Kazusato Oko
Yujin Song
Taiji Suzuki
Denny Wu
39
4
0
04 Nov 2024
Towards evaluations-based safety cases for AI scheming
Mikita Balesni
Marius Hobbhahn
David Lindner
Alexander Meinke
Tomek Korbak
...
Dan Braun
Bilal Chughtai
Owain Evans
Daniel Kokotajlo
Lucius Bushnaq
ELM
44
9
0
29 Oct 2024
Estimating the Probabilities of Rare Outputs in Language Models
Gabriel Wu
Jacob Hilton
AAML
UQCV
42
2
0
17 Oct 2024
Neural networks that overcome classic challenges through practice
Kazuki Irie
Brenden M. Lake
34
4
0
14 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip H. S. Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
43
10
0
09 Oct 2024
Towards Measuring Goal-Directedness in AI Systems
Dylan Xu
Juan-Pablo Rivera
24
3
0
07 Oct 2024
Backdoor defense, learnability and obfuscation
Paul Christiano
Jacob Hilton
Victor Lecomte
Mark Xu
AAML
18
1
0
04 Sep 2024
Beyond Preferences in AI Alignment
Tan Zhi-Xuan
Micah Carroll
Matija Franklin
Hal Ashton
38
16
0
30 Aug 2024
Evaluating Stability of Unreflective Alignment
James Lucassen
Mark Henry
Philippa Wright
Owen Yeung
36
0
0
27 Aug 2024
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
64
5
0
21 Aug 2024
On the Undecidability of Artificial Intelligence Alignment: Machines that Halt
Gabriel Adriano de Melo
Marcos Ricardo Omena de Albuquerque Máximo
Nei Yoshihiro Soma
Paulo Andre Lima de Castro
32
0
0
16 Aug 2024
On the Generalization of Preference Learning with DPO
Shawn Im
Yixuan Li
49
1
0
06 Aug 2024
Designing Time-Series Models With Hypernetworks & Adversarial Portfolios
Filip Stanek
AI4TS
31
0
0
29 Jul 2024
Evaluating AI Evaluation: Perils and Prospects
John Burden
ELM
33
8
0
12 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
43
29
0
05 Jul 2024
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Sara Price
Arjun Panickssery
Sam Bowman
Asa Cooper Stickland
LLMSV
29
3
0
04 Jul 2024
Towards shutdownable agents via stochastic choice
Elliott Thornley
Alexander Roman
Christos Ziakas
Leyton Ho
Louis Thomson
38
0
0
30 Jun 2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Carson E. Denison
M. MacDiarmid
Fazl Barez
David Duvenaud
Shauna Kravec
...
Jared Kaplan
Buck Shlegeris
Samuel R. Bowman
Ethan Perez
Evan Hubinger
56
36
0
14 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
45
23
0
11 Jun 2024
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner
Shreyas Kapur
Vasil Georgiev
Cameron Allen
Scott Emmons
Stuart J. Russell
32
10
0
02 Jun 2024
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability
Chenyu Zheng
Wei Huang
Rongzheng Wang
Guoqiang Wu
Jun Zhu
Chongxuan Li
39
1
0
27 May 2024
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
Jacob Russin
Sam Whitman McGrath
Danielle J. Williams
Lotem Elber-Dorozko
AI4CE
73
3
0
24 May 2024
Skin-in-the-Game: Decision Making via Multi-Stakeholder Alignment in LLMs
Bilgehan Sel
Priya Shanmugasundaram
Mohammad Kachuee
Kun Zhou
Ruoxi Jia
Ming Jin
LRM
40
2
0
21 May 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
52
52
0
10 May 2024
Generative AI in Cybersecurity
Shivani Metta
Isaac Chang
Jack Parker
Michael P. Roman
Arturo F. Ehuan
34
4
0
02 May 2024
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi
Evan Hubinger
24
12
0
25 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
40
112
0
22 Apr 2024
Understanding the Learning Dynamics of Alignment with Human Feedback
Shawn Im
Yixuan Li
ALM
32
11
0
27 Mar 2024
SoFA: Shielded On-the-fly Alignment via Priority Rule Following
Xinyu Lu
Bowen Yu
Yaojie Lu
Hongyu Lin
Haiyang Yu
Le Sun
Xianpei Han
Yongbin Li
60
13
0
27 Feb 2024
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
Yuejiang Liu
Alexandre Alahi
31
18
0
23 Feb 2024
Linear Transformers are Versatile In-Context Learners
Max Vladymyrov
J. Oswald
Mark Sandler
Rong Ge
31
13
0
21 Feb 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip H. S. Torr
Lewis Hammond
Christian Schroeder de Witt
40
4
0
12 Feb 2024
Quantifying stability of non-power-seeking in artificial agents
Evan Ryan Gunter
Yevgeny Liokumovich
Victoria Krakovna
29
1
0
07 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
41
258
0
14 Dec 2023
Honesty Is the Best Policy: Defining and Mitigating AI Deception
Francis Rhys Ward
Francesco Belardinelli
Francesca Toni
Tom Everitt
110
27
0
03 Dec 2023
Eliciting Latent Knowledge from Quirky Language Models
Alex Troy Mallen
Madeline Brumley
Julia Kharchenko
Nora Belrose
HILM
RALM
KELM
19
25
0
02 Dec 2023
In-context Learning and Gradient Descent Revisited
Gilad Deutch
Nadav Magar
Tomer Bar Natan
Guy Dar
28
8
0
13 Nov 2023
An International Consortium for Evaluations of Societal-Scale Risks from Advanced AI
Ross Gruetzemacher
Alan Chan
Kevin Frazier
Christy Manning
Stepán Los
...
Clíodhna Ní Ghuidhir
Mark M. Bailey
Daniel Eth
Toby D. Pilditch
Kyle A. Kilian
24
5
0
22 Oct 2023
1
2
3
Next