v1v2v3v4v5v6v7v8v9v10 (latest)

Optimal Policies Tend to Seek Power

Neural Information Processing Systems (NeurIPS), 2019

3 December 2019

Alexander Matt Turner

Papers citing "Optimal Policies Tend to Seek Power"

50 / 62 papers shown

Title
Agentic Misalignment: How LLMs Could Be Insider Threats Aengus Lynch Benjamin Wright Caleb Larson Stuart Ritchie Sören Mindermann Ethan Perez Kevin K. Troy Evan Hubinger 104 31 0 05 Oct 2025
Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization Antoine Maier Aude Maier Tom David 76 0 0 03 Oct 2025
Estimating the Empowerment of Language Model Agents Jinyeop Song Jeff Gore Max Kleiman-Weiner 118 1 0 26 Sep 2025
HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants Benjamin Sturgeon Daniel Samuelson Jacob Haimes Jacy Reese Anthis 183 1 0 10 Sep 2025
Understanding Action Effects through Instrumental Empowerment in Multi-Agent Reinforcement Learning Ardian Selmonaj M. Strupl Oleg Szehr Alessandro Antonucci 115 0 0 21 Aug 2025
AI Testing Should Account for Sophisticated Strategic Behaviour Vojtěch Kovařík Eric Olav Chen Sami Petersen Alexis Ghersengorin Vincent Conitzer 105 1 0 19 Aug 2025
Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power Jobst Heitzig Ram Potham 105 0 0 31 Jul 2025
Misalignment or misuse? The AGI alignment tradeoffPhilosophical Studies (Philos. Stud.), 2025 Max Hellrigel-Holderbaum Leonard Dung 234 2 0 04 Jun 2025
The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships? Djallel Bouneffouf Matthew D Riemer Kush R. Varshney 217 0 0 02 Jun 2025
Will artificial agents pursue power by default? Christian Tarsney 96 0 0 02 Jun 2025
Plasticity as the Mirror of Empowerment David Abel Michael Bowling André Barreto Will Dabney Shi Dong ... Doina Precup Jonathan Richens Mark Rowland Tom Schaul Satinder Singh AI4CE 341 2 0 15 May 2025
Societal Alignment Frameworks Can Improve LLM Alignment Karolina Stañczak Nicholas Meade Mehar Bhatia Hattie Zhou Konstantin Böttinger ... Timothy P. Lillicrap Ana Marasović Sylvie Delacroix Gillian K. Hadfield Siva Reddy 972 3 0 27 Feb 2025
Universal AI maximizes Variational EmpowermentArtificial General Intelligence (AGI), 2025 Yusuke Hayashi Koichi Takahashi 142 0 0 20 Feb 2025
Learning to Assist Humans without Inferring RewardsNeural Information Processing Systems (NeurIPS), 2024 Vivek Myers Evan Ellis Sergey Levine Benjamin Eysenbach Anca Dragan 506 10 0 17 Jan 2025
Towards evaluations-based safety cases for AI scheming Mikita Balesni Marius Hobbhahn David Lindner Alexander Meinke Tomek Korbak ... Dan Braun Bilal Chughtai Owain Evans Daniel Kokotajlo Lucius Bushnaq ELM 237 21 0 29 Oct 2024
Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards Grant C. Forbes Leonardo Villalobos-Arias Jianxun Wang Arnav Jhala David L. Roberts 219 2 0 16 Oct 2024
On Goodhart's law, with an application to value alignment El-Mahdi El-Mhamdi Lê-Nguyên Hoang 103 4 0 12 Oct 2024
RL, but don't do anything I wouldn't doConference on Uncertainty in Artificial Intelligence (UAI), 2024 Michael K. Cohen Marcus Hutter Yoshua Bengio Stuart J. Russell OffRL 156 2 0 08 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions Yu-Shin Huang Peter Just Krishna Narayanan Chao Tian 250 15 0 06 Oct 2024
Beyond Preferences in AI AlignmentPhilosophical Studies (Philos. Stud.), 2024 Tan Zhi-Xuan Micah Carroll Matija Franklin Hal Ashton 307 37 0 30 Aug 2024
Evaluating AI Evaluation: Perils and Prospects John Burden ELM 192 13 0 12 Jul 2024
Towards shutdownable agents via stochastic choice Elliott Thornley Alexander Roman Christos Ziakas Leyton Ho Louis Thomson 412 1 0 30 Jun 2024
Games of Knightian Uncertainty as AGI testbeds Spyridon Samothrakis Dennis J. N. J. Soemers Damian Machlanski 226 1 0 26 Jun 2024
The Benefits of Power Regularization in Cooperative Reinforcement Learning Michelle Li Michael Dennis 191 3 0 17 Jun 2024
Dishonesty in Helpful and Harmless Alignment Youcheng Huang Jingkun Tang Duanyu Feng Zheng Zhang Wenqiang Lei Jiancheng Lv Anthony G. Cohn LLMSV 275 4 0 04 Jun 2024
REvolve: Reward Evolution with Large Language Models using Human Feedback Rishi Hazra Alkis Sygkounas Andreas Persson Amy Loutfi Pedro Zuidberg Dos Martires 318 3 0 03 Jun 2024
Contestable AI needs Computational ArgumentationInternational Conference on Principles of Knowledge Representation and Reasoning (KR), 2024 Francesco Leofante Hamed Ayoobi Adam Dejl Gabriel Freedman Deniz Gorur ... Anna Rapberger Fabrizio Russo Xiang Yin Dekai Zhang Francesca Toni 204 12 0 17 May 2024
OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety Chuang Liu Linhao Yu Jiaxuan Li Renren Jin Yufei Huang ... Tao Liu Jinwang Song Hongying Zan Sun Li Deyi Xiong ELM 291 13 0 18 Mar 2024
The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists Elliott Thornley 137 17 0 07 Mar 2024
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking Cassidy Laidlaw Shivam Singhal Anca Dragan AAML 278 11 0 05 Mar 2024
Quantifying stability of non-power-seeking in artificial agents Evan Ryan Gunter Yevgeny Liokumovich Victoria Krakovna 255 2 0 07 Jan 2024
Measuring Value Alignment Fazl Barez Juil Sock 87 5 0 23 Dec 2023
Preventing Language Models From Hiding Their Reasoning Fabien Roger Ryan Greenblatt LRM 408 28 0 27 Oct 2023
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking Rose Hadshar 132 9 0 27 Oct 2023
Managing extreme AI risks amid rapid progress Yoshua Bengio Geoffrey Hinton Andrew Yao Dawn Song Pieter Abbeel ... Juil Sock Stuart J. Russell Daniel Kahneman J. Brauner Sören Mindermann 280 30 0 26 Oct 2023
Improving Generalization of Alignment with Human Preferences through Group Invariant LearningInternational Conference on Learning Representations (ICLR), 2023 Rui Zheng Wei Shen Yuan Hua Wenbin Lai Jiajun Sun ... Xiao Wang Haoran Huang Tao Gui Tao Gui Xuanjing Huang 257 22 0 18 Oct 2023
AI Systems of Concern Kayla Matteucci S. Avin Fazl Barez Seán Ó hÉigeartaigh 188 1 0 09 Oct 2023
Large Language Model Alignment: A Survey Shangda Wu Renren Jin Yufei Huang Chuang Liu Weilong Dong Zishan Guo Xinwei Wu Yan Liu Deyi Xiong LM&MA 320 273 0 26 Sep 2023
A Case for AI Safety via Law Jeffrey W. Johnston 214 1 0 31 Jul 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Stephen Casper Xander Davies Claudia Shi T. Gilbert Jérémy Scheurer ... Erdem Biyik Anca Dragan David M. Krueger Dorsa Sadigh Dylan Hadfield-Menell ALM OffRL 321 689 0 27 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety Markus Anderljung Joslyn Barnhart Anton Korinek Jade Leung Cullen O'Keefe ... Jonas Schuett Yonadav Shavit Divya Siddarth Robert F. Trager Kevin J. Wolf SILM 327 151 0 06 Jul 2023
Human Control: Definitions and AlgorithmsConference on Uncertainty in Artificial Intelligence (UAI), 2023 Ryan Carey Tom Everitt 200 12 0 31 May 2023
Incentivizing honest performative predictions with proper scoring rulesConference on Uncertainty in Artificial Intelligence (UAI), 2023 Caspar Oesterheld Johannes Treutlein Emery Cooper Rubi Hudson 225 9 0 28 May 2023
Model evaluation for extreme risks Toby Shevlane Sebastian Farquhar Ben Garfinkel Mary Phuong Jess Whittlestone ... Vijay Bolina Jack Clark Yoshua Bengio Paul Christiano Allan Dafoe ELM 245 193 0 24 May 2023
Selection for short-term empowerment accelerates the evolution of homeostatic neural cellular automataAnnual Conference on Genetic and Evolutionary Computation (GECCO), 2023 Caitlin Grasso Josh Bongard 110 3 0 24 May 2023
Power-seeking can be probable and predictive for trained agents Victoria Krakovna János Kramár TDI 128 20 0 13 Apr 2023
Eight Things to Know about Large Language Models Sam Bowman ALM 287 134 0 02 Apr 2023
Unifying Grokking and Double Descent Peter W. Battaglia David Raposo Kelsey 243 46 0 10 Mar 2023
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal StandardsSocial Science Research Network (SSRN), 2023 John J. Nay ELM AILaw 218 19 0 24 Jan 2023
Scaling Laws for Reward Model OveroptimizationInternational Conference on Machine Learning (ICML), 2022 Leo Gao John Schulman Jacob Hilton ALM 289 751 0 19 Oct 2022