Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.09289
Cited By
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
13 June 2024
Sarah Ball
Frauke Kreuter
Nina Rimsky
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models"
11 / 11 papers shown
Title
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
Erfan Shayegani
G M Shahariar
Sara Abdali
Lei Yu
Nael B. Abu-Ghazaleh
Yue Dong
AAML
53
0
0
01 Apr 2025
Towards LLM Guardrails via Sparse Representation Steering
Zeqing He
Zhibo Wang
Huiyu Xu
Kui Ren
LLMSV
49
1
0
21 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
63
0
0
08 Mar 2025
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
49
3
0
17 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch
Constantin Weisser
Severin Field
Helen Yannakoudakis
Stephen Casper
29
1
0
02 Nov 2024
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
Rui Pu
Chaozhuo Li
Rui Ha
Zejian Chen
Litian Zhang
Z. Liu
Lirong Qiu
Xi Zhang
AAML
16
1
0
18 Oct 2024
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
Han Shen
Pin-Yu Chen
Payel Das
Tianyi Chen
ALM
26
11
0
09 Oct 2024
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
K. Ramamurthy
Erik Miehling
Pierre L. Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
91
13
0
06 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
70
18
0
02 Jul 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
64
95
0
03 Jan 2024
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
275
1,561
0
18 Sep 2019
1