ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.13708
  4. Cited By
On the Role of Attention Heads in Large Language Model Safety
v1v2 (latest)

On the Role of Attention Heads in Large Language Model Safety

International Conference on Learning Representations (ICLR), 2024
17 October 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Cunchun Li
Yongbin Li
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "On the Role of Attention Heads in Large Language Model Safety"

50 / 89 papers shown
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Jason Vega
Gagandeep Singh
AAML
80
0
0
05 Dec 2025
Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models
Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models
Cen Lu
Yung-Chen Tang
Andrea Cavallaro
55
0
0
30 Nov 2025
WebRec: Enhancing LLM-based Recommendations with Attention-guided RAG from Web
WebRec: Enhancing LLM-based Recommendations with Attention-guided RAG from Web
Zihuai Zhao
Yujuan Ding
Wenqi Fan
Qing Li
3DV
280
0
0
18 Nov 2025
Investigating CoT Monitorability in Large Reasoning Models
Investigating CoT Monitorability in Large Reasoning Models
Shu Yang
Junchao Wu
Xilin Gou
X. Wu
Yang Li
Ninhao Liu
Di Wang
LRM
222
1
0
11 Nov 2025
Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models
Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models
Xin Liu
Qiyang Song
Qihang Zhou
Haichao Du
Shaowen Xu
Wenbo Jiang
Weijuan Zhang
X. Jia
LRM
130
0
0
10 Nov 2025
Chain-of-Thought Hijacking
Chain-of-Thought Hijacking
Jianli Zhao
Tingchen Fu
Rylan Schaeffer
Mrinank Sharma
Fazl Barez
LRM
182
3
0
30 Oct 2025
CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs
Shaobo Wang
Yongliang Miao
Yuancheng Liu
Qianli Ma
Ning Liao
Linfeng Zhang
LRM
172
1
0
21 Oct 2025
VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search
VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search
MingSheng Li
Guangze Zhao
Sichen Liu
136
0
0
10 Oct 2025
Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Jan Fiszer
Dominika Ciupek
Maciej Malawski
FedML
219
2
0
08 Oct 2025
Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling
Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling
Mary Llewellyn
Annie Gray
Josh Collyer
Michael Harries
125
0
0
07 Oct 2025
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Yein Park
Jungwoo Park
Jaewoo Kang
179
0
0
30 Sep 2025
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Miao Yu
Zhenhong Zhou
Moayad Aloqaily
Kun Wang
Biwei Huang
S. Wang
Yueming Jin
Qingsong Wen
AAMLLLMSV
279
0
0
26 Sep 2025
Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers
Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers
Guanbin Li
Miao Yu
Kaiwen Luo
Yibo Zhang
Lilan Peng
...
Yuanhe Zhang
Xikang Yang
Zhenhong Zhou
Kun Wang
Yang Liu
AAML
222
3
0
04 Aug 2025
Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning
Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning
J. Park
Wonjong Rhee
268
0
0
28 Jul 2025
LLMs Encode Harmfulness and Refusal Separately
LLMs Encode Harmfulness and Refusal Separately
Jiachen Zhao
Jing-ling Huang
Zhengxuan Wu
David Bau
Weiyan Shi
LLMSV
453
10
0
16 Jul 2025
Large Language Models Encode Semantics and Alignment in Linearly Separable Representations
Large Language Models Encode Semantics and Alignment in Linearly Separable Representations
Baturay Saglam
Paul Kassianik
Blaine Nelson
Sajana Weerawardhena
Yaron Singer
Amin Karbasi
197
3
0
13 Jul 2025
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification
ZhengLin Lai
MengYao Liao
Bingzhe Wu
Dong Xu
Zebin Zhao
Zhihang Yuan
Chao Fan
Jianqiang Li
MoE
216
6
0
20 Jun 2025
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
Aishan Liu
Zonghao Ying
L. Wang
Junjie Mu
Jinyang Guo
Jinyang Guo
Yuqing Ma
Yaning Tan
Mingchuan Zhang
Xianglong Liu
409
13
0
17 Jun 2025
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025
Zonghao Ying
Siyang Wu
Run Hao
Peng Ying
Shixuan Sun
...
Xianglong Liu
Dawn Song
Yaoyao Liu
Juil Sock
Dacheng Tao
326
10
0
14 Jun 2025
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee
Aeree Cho
Grace C. Kim
ShengYun Peng
Mansi Phute
Duen Horng Chau
LM&MAAI4CE
356
5
0
05 Jun 2025
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Hao Chen
Haoze Li
Zhiqing Xiao
Lirong Gao
Qi Zhang
Xiaomeng Hu
Ningtao Wang
Xing Fu
Junbo Zhao
619
0
0
24 May 2025
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yue Li
Xin Yi
Dongsheng Shi
Gerard de Melo
Xiaoling Wang
Linlin Wang
358
1
0
22 May 2025
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Chung-En Sun
Ge Yan
Tsui-Wei Weng
KELMLRM
524
13
0
27 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
445
3
0
08 Mar 2025
Understanding and Rectifying Safety Perception Distortion in VLMs
Understanding and Rectifying Safety Perception Distortion in VLMs
Xiaohan Zou
Jian Kang
George Kesidis
Lu Lin
1.0K
5
0
18 Feb 2025
Reinforced Lifelong Editing for Language Models
Reinforced Lifelong Editing for Language Models
Zherui Li
Houcheng Jiang
Hao Chen
Baolong Bi
Zhenhong Zhou
Fei Sun
Cunchun Li
Xinze Wang
KELM
637
23
0
09 Feb 2025
Attention Heads of Large Language Models: A Survey
Attention Heads of Large Language Models: A SurveyPatterns (Patterns), 2024
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Chenyang Xi
Junchi Yan
Bo Tang
Feiyu Xiong
Zhiyu Li
LRM
301
70
0
05 Sep 2024
Extend Model Merging from Fine-Tuned to Pre-Trained Large Language
  Models via Weight Disentanglement
Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement
Le Yu
Bowen Yu
Haiyang Yu
Fei Huang
Yongbin Li
MoMe
281
11
0
06 Aug 2024
Course-Correction: Safety Alignment Using Synthetic Preferences
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu
Yishuo Cai
Zhenhong Zhou
Renjie Gu
Haiqin Weng
Yan Liu
Tianwei Zhang
Wei Xu
Han Qiu
217
13
0
23 Jul 2024
Qwen2 Technical Report
Qwen2 Technical Report
An Yang
Baosong Yang
Binyuan Hui
Jian Xu
Bowen Yu
...
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zhifang Guo
Zhi-Wei Fan
OSLMVLMMU
684
1,822
0
15 Jul 2024
Transformer Layers as Painters
Transformer Layers as Painters
Qi Sun
Marc Pickett
Aakash Kumar Nain
Llion Jones
AI4CE
614
42
0
12 Jul 2024
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen
Xiaozhi Wang
Zijun Yao
Yushi Bai
Lei Hou
Juanzi Li
LLMSVKELM
379
26
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
395
471
0
17 Jun 2024
How Alignment and Jailbreak Work: Explain LLM Safety through
  Intermediate Hidden States
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Yongbin Li
426
82
0
09 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large
  Language Models
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Yang Liu
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Simeng Qin
Min Lin
AAML
371
87
0
31 May 2024
Defending Large Language Models Against Jailbreak Attacks via
  Layer-specific Editing
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
Wei Zhao
Zhe Li
Yige Li
Ye Zhang
Junfeng Sun
KELMAAML
264
64
0
28 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning
  Attacks
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
353
31
0
25 May 2024
Retrieval Head Mechanistically Explains Long-Context Factuality
Retrieval Head Mechanistically Explains Long-Context Factuality
Wenhao Wu
Yizhong Wang
Guangxuan Xiao
Hao-Chun Peng
Yao Fu
LRM
225
151
0
24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
428
332
0
22 Apr 2024
AmpleGCG: Learning a Universal and Transferable Generative Model of
  Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
Zeyi Liao
Huan Sun
AAML
326
153
0
11 Apr 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large
  Language Models
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Avi Schwarzschild
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALMELMAAML
495
328
0
28 Mar 2024
Knowledge Conflicts for LLMs: A Survey
Knowledge Conflicts for LLMs: A SurveyConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Rongwu Xu
Zehan Qi
Zhijiang Guo
Cunxiang Wang
Hongru Wang
Yue Zhang
Wei Xu
1.3K
212
0
13 Mar 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank
  Modifications
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
354
184
0
07 Feb 2024
On Prompt-Driven Safeguarding for Large Language Models
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng
Fan Yin
Hao Zhou
Fandong Meng
Jie Zhou
Kai-Wei Chang
Shiyu Huang
Nanyun Peng
AAML
516
106
0
31 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
376
527
0
12 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO
  and Toxicity
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and ToxicityInternational Conference on Machine Learning (ICML), 2024
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Amélie Reymond
364
165
0
03 Jan 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Successor Heads: Recurring, Interpretable Attention Heads In The WildInternational Conference on Learning Representations (ICLR), 2023
Rhys Gould
Euan Ong
George Ogden
Arthur Conmy
LRM
283
68
0
14 Dec 2023
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context
  Learning
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Bill Yuchen Lin
Abhilasha Ravichander
Ximing Lu
Nouha Dziri
Melanie Sclar
Khyathi Chandu
Chandra Bhagavatula
Yejin Choi
257
276
0
04 Dec 2023
Localizing Lying in Llama: Understanding Instructed Dishonesty on
  True-False Questions Through Prompting, Probing, and Patching
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching
James Campbell
Richard Ren
Phillip Guo
HILM
231
25
0
25 Nov 2023
Language Models are Super Mario: Absorbing Abilities from Homologous
  Models as a Free Lunch
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free LunchInternational Conference on Machine Learning (ICML), 2023
Le Yu
Yu Bowen
Haiyang Yu
Fei Huang
Yongbin Li
MoMe
582
534
0
06 Nov 2023
12
Next
Page 1 of 2