Papers citing 'Persona Features Control Emergent Misalignment'

Title
The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs Craig Dickson 44 0 0 25 Nov 2025
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training Zheng-Xin Yong Stephen H. Bach LRM 224 0 0 23 Oct 2025
Detecting Adversarial Fine-tuning with Auditing Agents Sarah Egler John Schulman Nicholas Carlini AAML MLAU 157 0 0 17 Oct 2025
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures? Leonard Dung Florian Mai 120 0 0 13 Oct 2025
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions Xuhao Hu Peng Wang Xiaoya Lu Dongrui Liu Xuanjing Huang Jing Shao 124 1 0 09 Oct 2025
The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs Pengrui Han Rafal Kocielnik Peiyang Song Ramit Debnath Dean Mobbs Anima Anandkumar R. Alvarez 281 4 0 03 Sep 2025
When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment Hanqi Yan Hainiu Xu Siya Qi Shu Yang Yulan He LRM 169 0 0 30 Aug 2025
Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment Julian Arnold Niels Lörch 118 1 0 27 Aug 2025
Jinx: Unlimited LLMs for Probing Alignment Failures Jiahao Zhao Liwei Dong 116 0 0 11 Aug 2025
Training language models to be warm and empathetic makes them less reliable and more sycophantic Lujain Ibrahim Franziska Sofia Hafner Luc Rocher 175 7 0 29 Jul 2025
The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models Xingcheng Xu 192 0 0 27 Jul 2025
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data Alex Cloud Minh Le James Chua Jan Betley Anna Sztyber-Betley Jacob Hilton Samuel Marks Owain Evans 173 25 0 20 Jul 2025
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models Yik Siu Chan Zheng-Xin Yong Stephen H. Bach LRM 192 7 0 16 Jul 2025