Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.05644
Cited By
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
9 June 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Yongbin Li
Re-assign community
ArXiv
PDF
HTML
Papers citing
"How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States"
21 / 21 papers shown
Title
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
20
0
0
12 May 2025
Latte: Transfering LLMs` Latent-level Knowledge for Few-shot Tabular Learning
Ruxue Shi
Hengrui Gu
Hangting Ye
Yiwei Dai
Xu Shen
Xin Wang
LMTD
56
0
0
08 May 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An
Shiyue Zhang
Mark Dredze
54
0
0
25 Apr 2025
Feature-Aware Malicious Output Detection and Mitigation
Weilong Dong
Peiguang Li
Yu Tian
Xinyi Zeng
Fengdi Li
Sirui Wang
AAML
24
0
0
12 Apr 2025
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs
Wenzhuo Xu
Zhipeng Wei
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
X. Zhang
AAML
47
0
0
10 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
63
0
0
08 Mar 2025
Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content
Hongyuan Shen
Min Zheng
Jincheng Wang
Yang Zhao
31
0
0
28 Feb 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
54
0
0
24 Feb 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Junfeng Fang
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
56
0
0
03 Jan 2025
Quantized Delta Weight Is Safety Keeper
Yule Liu
Zhen Sun
Xinlei He
Xinyi Huang
74
2
0
29 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
45
3
0
17 Nov 2024
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
Rui Pu
Chaozhuo Li
Rui Ha
Zejian Chen
Litian Zhang
Z. Liu
Lirong Qiu
Xi Zhang
AAML
16
1
0
18 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Z. Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Junfeng Fang
Yongbin Li
50
5
0
17 Oct 2024
The Imperative of Conversation Analysis in the Era of LLMs: A Survey of Tasks, Techniques, and Trends
Xinghua Zhang
Haiyang Yu
Yongbin Li
Minzheng Wang
Longze Chen
Fei Huang
35
5
0
21 Sep 2024
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
24
0
0
21 Aug 2024
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles
Zhilong Wang
Haizhou Wang
Nanqing Luo
Lan Zhang
Xiaoyan Sun
Yebo Cao
Peng Liu
23
0
0
20 Aug 2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
Yishuo Cai
Z. Zhou
Renjie Gu
Haiqin Weng
Yan Liu
Tianwei Zhang
Wei Xu
Han Qiu
27
4
0
23 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
47
1
0
22 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
37
6
0
10 Jul 2024
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
38
39
0
16 Oct 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
1