v1v2v3v4v5v6v7v8v9v10 (latest)

Optimal Policies Tend to Seek Power

Neural Information Processing Systems (NeurIPS), 2019

3 December 2019

Alexander Matt Turner

Papers citing "Optimal Policies Tend to Seek Power"

50 / 65 papers shown

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Kai Williams

Rohan Subramani

Francis Rhys Ward

29 Nov 2025

Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?

Willem Fourie

29 Oct 2025

Agentic Misalignment: How LLMs Could Be Insider Threats

157

05 Oct 2025

Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

Antoine Maier

Aude Maier

Tom David

03 Oct 2025

Estimating the Empowerment of Language Model Agents

Jinyeop Song

Jeff Gore

Max Kleiman-Weiner

134

26 Sep 2025

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

207

10 Sep 2025

Understanding Action Effects through Instrumental Empowerment in Multi-Agent Reinforcement Learning

151

21 Aug 2025

AI Testing Should Account for Sophisticated Strategic Behaviour

133

19 Aug 2025

Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power

Jobst Heitzig

Ram Potham

153

31 Jul 2025

Misalignment or misuse? The AGI alignment tradeoffPhilosophical Studies (Philos. Stud.), 2025

Max Hellrigel-Holderbaum

Leonard Dung

277

04 Jun 2025

The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?

Djallel Bouneffouf

Matthew D Riemer

Kush R. Varshney

253

02 Jun 2025

Will artificial agents pursue power by default?

Christian Tarsney

117

02 Jun 2025

Plasticity as the Mirror of Empowerment

...

414

15 May 2025

Societal Alignment Frameworks Can Improve LLM Alignment

...

1.0K

27 Feb 2025

Universal AI maximizes Variational EmpowermentArtificial General Intelligence (AGI), 2025

Yusuke Hayashi

Koichi Takahashi

145

20 Feb 2025

Learning to Assist Humans without Inferring RewardsNeural Information Processing Systems (NeurIPS), 2024

567

17 Jan 2025

Active Inference and Human--Computer Interaction

175

19 Dec 2024

Towards evaluations-based safety cases for AI scheming

...

261

29 Oct 2024

Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards

Grant C. Forbes

Leonardo Villalobos-Arias

Jianxun Wang

Arnav Jhala

David L. Roberts

263

16 Oct 2024

On Goodhart's law, with an application to value alignment

El-Mahdi El-Mhamdi

Lê-Nguyên Hoang

152

12 Oct 2024

RL, but don't do anything I wouldn't doConference on Uncertainty in Artificial Intelligence (UAI), 2024

Michael K. Cohen

180

08 Oct 2024

OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions

270

06 Oct 2024

Beyond Preferences in AI AlignmentPhilosophical Studies (Philos. Stud.), 2024

Tan Zhi-Xuan

Micah Carroll

Matija Franklin

Hal Ashton

343

30 Aug 2024

Evaluating AI Evaluation: Perils and Prospects

John Burden

ELM

220

12 Jul 2024

Towards shutdownable agents via stochastic choice

456

30 Jun 2024

Games of Knightian Uncertainty as AGI testbeds

Spyridon Samothrakis

Dennis J. N. J. Soemers

Damian Machlanski

282

26 Jun 2024

The Benefits of Power Regularization in Cooperative Reinforcement Learning

Michelle Li

Michael Dennis

224

17 Jun 2024

Dishonesty in Helpful and Harmless Alignment

303

04 Jun 2024

REvolve: Reward Evolution with Large Language Models using Human Feedback

Pedro Zuidberg Dos Martires

360

03 Jun 2024

Contestable AI needs Computational ArgumentationInternational Conference on Principles of Knowledge Representation and Reasoning (KR), 2024

...

Xiang Yin

Francesca Toni

263

17 May 2024

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

...

327

18 Mar 2024

The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

Elliott Thornley

157

07 Mar 2024

Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

375

05 Mar 2024

Quantifying stability of non-power-seeking in artificial agents

Evan Ryan Gunter

Yevgeny Liokumovich

Victoria Krakovna

287

07 Jan 2024

Measuring Value Alignment

Fazl Barez

Juil Sock

109

23 Dec 2023

Preventing Language Models From Hiding Their Reasoning

Fabien Roger

Ryan Greenblatt

LRM

448

27 Oct 2023

A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking

Rose Hadshar

166

27 Oct 2023

Managing extreme AI risks amid rapid progress

...

340

26 Oct 2023

Improving Generalization of Alignment with Human Preferences through Group Invariant LearningInternational Conference on Learning Representations (ICLR), 2023

Wei Shen

...

Xuanjing Huang

285

18 Oct 2023

AI Systems of Concern

205

09 Oct 2023

Large Language Model Alignment: A Survey

356

278

26 Sep 2023

A Case for AI Safety via Law

Jeffrey W. Johnston

258

31 Jul 2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

...

Dorsa Sadigh

Dylan Hadfield-Menell

ALM OffRL

354

712

27 Jul 2023

Frontier AI Regulation: Managing Emerging Risks to Public Safety

...

Divya Siddarth

425

154

06 Jul 2023

Human Control: Definitions and AlgorithmsConference on Uncertainty in Artificial Intelligence (UAI), 2023

Ryan Carey

Tom Everitt

216

31 May 2023

Incentivizing honest performative predictions with proper scoring rulesConference on Uncertainty in Artificial Intelligence (UAI), 2023

278

28 May 2023

Model evaluation for extreme risks

...

289

195

24 May 2023

Selection for short-term empowerment accelerates the evolution of homeostatic neural cellular automataAnnual Conference on Genetic and Evolutionary Computation (GECCO), 2023

Caitlin Grasso

Josh Bongard

138

24 May 2023

Power-seeking can be probable and predictive for trained agents

Victoria Krakovna

János Kramár

TDI

154

13 Apr 2023

Eight Things to Know about Large Language Models

Sam Bowman

ALM

311

139

02 Apr 2023