Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2410.13708
Cited By
v1
v2 (latest)
On the Role of Attention Heads in Large Language Model Safety
International Conference on Learning Representations (ICLR), 2024
17 October 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Cunchun Li
Yongbin Li
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"On the Role of Attention Heads in Large Language Model Safety"
50 / 89 papers shown
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Jason Vega
Gagandeep Singh
AAML
80
0
0
05 Dec 2025
Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models
Cen Lu
Yung-Chen Tang
Andrea Cavallaro
55
0
0
30 Nov 2025
WebRec: Enhancing LLM-based Recommendations with Attention-guided RAG from Web
Zihuai Zhao
Yujuan Ding
Wenqi Fan
Qing Li
3DV
280
0
0
18 Nov 2025
Investigating CoT Monitorability in Large Reasoning Models
Shu Yang
Junchao Wu
Xilin Gou
X. Wu
Yang Li
Ninhao Liu
Di Wang
LRM
222
1
0
11 Nov 2025
Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models
Xin Liu
Qiyang Song
Qihang Zhou
Haichao Du
Shaowen Xu
Wenbo Jiang
Weijuan Zhang
X. Jia
LRM
130
0
0
10 Nov 2025
Chain-of-Thought Hijacking
Jianli Zhao
Tingchen Fu
Rylan Schaeffer
Mrinank Sharma
Fazl Barez
LRM
182
3
0
30 Oct 2025
CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
Shaobo Wang
Yongliang Miao
Yuancheng Liu
Qianli Ma
Ning Liao
Linfeng Zhang
LRM
172
1
0
21 Oct 2025
VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search
MingSheng Li
Guangze Zhao
Sichen Liu
136
0
0
10 Oct 2025
Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Jan Fiszer
Dominika Ciupek
Maciej Malawski
FedML
219
2
0
08 Oct 2025
Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling
Mary Llewellyn
Annie Gray
Josh Collyer
Michael Harries
125
0
0
07 Oct 2025
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Yein Park
Jungwoo Park
Jaewoo Kang
179
0
0
30 Sep 2025
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Miao Yu
Zhenhong Zhou
Moayad Aloqaily
Kun Wang
Biwei Huang
S. Wang
Yueming Jin
Qingsong Wen
AAML
LLMSV
279
0
0
26 Sep 2025
Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers
Guanbin Li
Miao Yu
Kaiwen Luo
Yibo Zhang
Lilan Peng
...
Yuanhe Zhang
Xikang Yang
Zhenhong Zhou
Kun Wang
Yang Liu
AAML
222
3
0
04 Aug 2025
Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning
J. Park
Wonjong Rhee
268
0
0
28 Jul 2025
LLMs Encode Harmfulness and Refusal Separately
Jiachen Zhao
Jing-ling Huang
Zhengxuan Wu
David Bau
Weiyan Shi
LLMSV
453
10
0
16 Jul 2025
Large Language Models Encode Semantics and Alignment in Linearly Separable Representations
Baturay Saglam
Paul Kassianik
Blaine Nelson
Sajana Weerawardhena
Yaron Singer
Amin Karbasi
197
3
0
13 Jul 2025
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
ZhengLin Lai
MengYao Liao
Bingzhe Wu
Dong Xu
Zebin Zhao
Zhihang Yuan
Chao Fan
Jianqiang Li
MoE
216
6
0
20 Jun 2025
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
Aishan Liu
Zonghao Ying
L. Wang
Junjie Mu
Jinyang Guo
Jinyang Guo
Yuqing Ma
Yaning Tan
Mingchuan Zhang
Xianglong Liu
409
13
0
17 Jun 2025
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025
Zonghao Ying
Siyang Wu
Run Hao
Peng Ying
Shixuan Sun
...
Xianglong Liu
Dawn Song
Yaoyao Liu
Juil Sock
Dacheng Tao
326
10
0
14 Jun 2025
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee
Aeree Cho
Grace C. Kim
ShengYun Peng
Mansi Phute
Duen Horng Chau
LM&MA
AI4CE
356
5
0
05 Jun 2025
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Hao Chen
Haoze Li
Zhiqing Xiao
Lirong Gao
Qi Zhang
Xiaomeng Hu
Ningtao Wang
Xing Fu
Junbo Zhao
619
0
0
24 May 2025
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yue Li
Xin Yi
Dongsheng Shi
Gerard de Melo
Xiaoling Wang
Linlin Wang
358
1
0
22 May 2025
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Chung-En Sun
Ge Yan
Tsui-Wei Weng
KELM
LRM
524
13
0
27 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
445
3
0
08 Mar 2025
Understanding and Rectifying Safety Perception Distortion in VLMs
Xiaohan Zou
Jian Kang
George Kesidis
Lu Lin
1.0K
5
0
18 Feb 2025
Reinforced Lifelong Editing for Language Models
Zherui Li
Houcheng Jiang
Hao Chen
Baolong Bi
Zhenhong Zhou
Fei Sun
Cunchun Li
Xinze Wang
KELM
637
23
0
09 Feb 2025
Attention Heads of Large Language Models: A Survey
Patterns (Patterns), 2024
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Chenyang Xi
Junchi Yan
Bo Tang
Feiyu Xiong
Zhiyu Li
LRM
301
70
0
05 Sep 2024
Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement
Le Yu
Bowen Yu
Haiyang Yu
Fei Huang
Yongbin Li
MoMe
281
11
0
06 Aug 2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
Yishuo Cai
Zhenhong Zhou
Renjie Gu
Haiqin Weng
Yan Liu
Tianwei Zhang
Wei Xu
Han Qiu
217
13
0
23 Jul 2024
Qwen2 Technical Report
An Yang
Baosong Yang
Binyuan Hui
Jian Xu
Bowen Yu
...
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zhifang Guo
Zhi-Wei Fan
OSLM
VLM
MU
684
1,822
0
15 Jul 2024
Transformer Layers as Painters
Qi Sun
Marc Pickett
Aakash Kumar Nain
Llion Jones
AI4CE
614
42
0
12 Jul 2024
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen
Xiaozhi Wang
Zijun Yao
Yushi Bai
Lei Hou
Juanzi Li
LLMSV
KELM
379
26
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
395
471
0
17 Jun 2024
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Yongbin Li
426
82
0
09 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Yang Liu
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Simeng Qin
Min Lin
AAML
371
87
0
31 May 2024
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
Wei Zhao
Zhe Li
Yige Li
Ye Zhang
Junfeng Sun
KELM
AAML
264
64
0
28 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
353
31
0
25 May 2024
Retrieval Head Mechanistically Explains Long-Context Factuality
Wenhao Wu
Yizhong Wang
Guangxuan Xiao
Hao-Chun Peng
Yao Fu
LRM
225
151
0
24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
428
332
0
22 Apr 2024
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
Zeyi Liao
Huan Sun
AAML
326
153
0
11 Apr 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Avi Schwarzschild
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALM
ELM
AAML
495
328
0
28 Mar 2024
Knowledge Conflicts for LLMs: A Survey
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Rongwu Xu
Zehan Qi
Zhijiang Guo
Cunxiang Wang
Hongru Wang
Yue Zhang
Wei Xu
1.3K
212
0
13 Mar 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
354
184
0
07 Feb 2024
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng
Fan Yin
Hao Zhou
Fandong Meng
Jie Zhou
Kai-Wei Chang
Shiyu Huang
Nanyun Peng
AAML
516
106
0
31 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
376
527
0
12 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
International Conference on Machine Learning (ICML), 2024
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Amélie Reymond
364
165
0
03 Jan 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
International Conference on Learning Representations (ICLR), 2023
Rhys Gould
Euan Ong
George Ogden
Arthur Conmy
LRM
283
68
0
14 Dec 2023
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Bill Yuchen Lin
Abhilasha Ravichander
Ximing Lu
Nouha Dziri
Melanie Sclar
Khyathi Chandu
Chandra Bhagavatula
Yejin Choi
257
276
0
04 Dec 2023
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
James Campbell
Richard Ren
Phillip Guo
HILM
231
25
0
25 Nov 2023
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
International Conference on Machine Learning (ICML), 2023
Le Yu
Yu Bowen
Haiyang Yu
Fei Huang
Yongbin Li
MoMe
582
534
0
06 Nov 2023
1
2
Next
Page 1 of 2