Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

24 February 2025

Papers citing "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs"

7 / 7 papers shown

Title
Patterns and Mechanisms of Contrastive Activation Engineering Yixiong Hao Ayush Panda Stepan Shabalin Sheikh Abdur Raheem Ali LLMSV 44 1 0 06 May 2025
Jekyll-and-Hyde Tipping Point in an AI's Behavior Neil F. Johnson Frank Yingjie Huo 34 13 0 29 Apr 2025
IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery Aniketh Garikaparthi Manasi S. Patwardhan L. Vig Arman Cohan VLM LRM 37 1 0 23 Apr 2025
Safety Pretraining: Toward the Next Generation of Safe AI Pratyush Maini Sachin Goyal Dylan Sam Alex Robey Yash Savani Yiding Jiang Andy Zou Zacharcy C. Lipton J. Zico Kolter 36 1 0 23 Apr 2025
Beyond Misinformation: A Conceptual Framework for Studying AI Hallucinations in (Science) Communication Anqi Shao 34 33 0 18 Apr 2025
Capturing AI's Attention: Physics of Repetition, Hallucination, Bias and Beyond Frank Yingjie Huo Neil F. Johnson 30 1 0 06 Apr 2025
Propaganda is all you need Paul Kronlund-Drouault 33 1 0 13 Sep 2024