19

Formal Analysis of AGI Decision-Theoretic Models and the Confrontation Question

Denis Saklakov
Main:16 Pages
Bibliography:1 Pages
2 Tables
Abstract

Artificial General Intelligence (AGI) may face a confrontation question: under what conditions would a rationally self-interested AGI choose to seize power or eliminate human control (a confrontation) rather than remain cooperative? We formalize this in a Markov decision process with a stochastic human-initiated shutdown event. Building on results on convergent instrumental incentives, we show that for almost all reward functions a misaligned agent has an incentive to avoid shutdown. We then derive closed-form thresholds for when confronting humans yields higher expected utility than compliant behavior, as a function of the discount factor γ\gamma, shutdown probability pp, and confrontation cost CC. For example, a far-sighted agent (γ=0.99\gamma=0.99) facing p=0.01p=0.01 can have a strong takeover incentive unless CC is sufficiently large. We contrast this with aligned objectives that impose large negative utility for harming humans, which makes confrontation suboptimal. In a strategic 2-player model (human policymaker vs AGI), we prove that if the AGI's confrontation incentive satisfies Δ0\Delta \ge 0, no stable cooperative equilibrium exists: anticipating this, a rational human will shut down or preempt the system, leading to conflict. If Δ<0\Delta < 0, peaceful coexistence can be an equilibrium. We discuss implications for reward design and oversight, extend the reasoning to multi-agent settings as conjectures, and note computational barriers to verifying Δ<0\Delta < 0, citing complexity results for planning and decentralized decision problems. Numerical examples and a scenario table illustrate regimes where confrontation is likely versus avoidable.

View on arXiv
Comments on this paper