Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

27 October 2024

Qi Zhang

Yisen Wang

Papers citing "Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness"

1 / 1 papers shown

Title
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 61 0 0 08 Mar 2025