Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2506.19823
Cited By
v1
v2 (latest)
Persona Features Control Emergent Misalignment
24 June 2025
Miles Wang
Tom Dupré la Tour
Olivia Watkins
Alex Makelov
Ryan A. Chi
Samuel Miserendino
Jeffrey Wang
Achyuta Rajaram
Johannes Heidecke
Tejal Patwardhan
Dan Mossing
Re-assign community
ArXiv (abs)
PDF
HTML
Github (39★)
Papers citing
"Persona Features Control Emergent Misalignment"
13 / 13 papers shown
Title
The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs
Craig Dickson
44
0
0
25 Nov 2025
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Zheng-Xin Yong
Stephen H. Bach
LRM
224
0
0
23 Oct 2025
Detecting Adversarial Fine-tuning with Auditing Agents
Sarah Egler
John Schulman
Nicholas Carlini
AAML
MLAU
157
0
0
17 Oct 2025
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?
Leonard Dung
Florian Mai
120
0
0
13 Oct 2025
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
Xuhao Hu
Peng Wang
Xiaoya Lu
Dongrui Liu
Xuanjing Huang
Jing Shao
124
1
0
09 Oct 2025
The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
Pengrui Han
Rafal Kocielnik
Peiyang Song
Ramit Debnath
Dean Mobbs
Anima Anandkumar
R. Alvarez
281
4
0
03 Sep 2025
When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
Hanqi Yan
Hainiu Xu
Siya Qi
Shu Yang
Yulan He
LRM
169
0
0
30 Aug 2025
Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment
Julian Arnold
Niels Lörch
118
1
0
27 Aug 2025
Jinx: Unlimited LLMs for Probing Alignment Failures
Jiahao Zhao
Liwei Dong
116
0
0
11 Aug 2025
Training language models to be warm and empathetic makes them less reliable and more sycophantic
Lujain Ibrahim
Franziska Sofia Hafner
Luc Rocher
175
7
0
29 Jul 2025
The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models
Xingcheng Xu
192
0
0
27 Jul 2025
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
Alex Cloud
Minh Le
James Chua
Jan Betley
Anna Sztyber-Betley
Jacob Hilton
Samuel Marks
Owain Evans
173
25
0
20 Jul 2025
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Yik Siu Chan
Zheng-Xin Yong
Stephen H. Bach
LRM
192
7
0
16 Jul 2025
1