Murphys Laws of AI Alignment: Why the Gap Always Wins
- ALM

Main:7 Pages
Bibliography:2 Pages
Appendix:8 Pages
Abstract
We prove a formal impossibility result for reinforcement learning from human feedback (RLHF). In misspecified environments with bounded query budgets, any RLHF-style learner suffers an irreducible performance gap Omega(gamma) unless it has access to a calibration oracle. We give tight lower bounds via an information-theoretic proof and show that a minimal calibration oracle suffices to eliminate the gap. Small-scale empirical illustrations and a catalogue of alignment regularities (Murphy's Laws) indicate that many observed alignment failures are consistent with this structural mechanism. Our results position Murphys Gap as both a diagnostic limit of RLHF and a guide for future work on calibration and causal preference checks.
View on arXivComments on this paper
