Formal Analysis of AGI Decision-Theoretic Models and the Confrontation Question

4 January 2026

Denis Saklakov

Main:16 Pages

Bibliography:1 Pages

2 Tables

Abstract

Artificial General Intelligence (AGI) may face a confrontation question: under what conditions would a rationally self-interested AGI choose to seize power or eliminate human control (a confrontation) rather than remain cooperative? We formalize this in a Markov decision process with a stochastic human-initiated shutdown event. Building on results on convergent instrumental incentives, we show that for almost all reward functions a misaligned agent has an incentive to avoid shutdown. We then derive closed-form thresholds for when confronting humans yields higher expected utility than compliant behavior, as a function of the discount factor $\gamma$ , shutdown probability $p$ , and confrontation cost $C$ . For example, a far-sighted agent ( $\gamma=0.99$ ) facing $p=0.01$ can have a strong takeover incentive unless $C$ is sufficiently large. We contrast this with aligned objectives that impose large negative utility for harming humans, which makes confrontation suboptimal. In a strategic 2-player model (human policymaker vs AGI), we prove that if the AGI's confrontation incentive satisfies $\Delta \ge 0$ , no stable cooperative equilibrium exists: anticipating this, a rational human will shut down or preempt the system, leading to conflict. If $\Delta < 0$ , peaceful coexistence can be an equilibrium. We discuss implications for reward design and oversight, extend the reasoning to multi-agent settings as conjectures, and note computational barriers to verifying $\Delta < 0$ , citing complexity results for planning and decentralized decision problems. Numerical examples and a scenario table illustrate regimes where confrontation is likely versus avoidable.

View on arXiv

Comments on this paper