ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.18166
  4. Cited By
Defending Large Language Models Against Jailbreak Attacks via
  Layer-specific Editing
v1v2 (latest)

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

28 May 2024
Wei Zhao
Zhe Li
Yige Li
Ye Zhang
Junfeng Sun
    KELMAAML
ArXiv (abs)PDFHTMLGithub (18★)

Papers citing "Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing"

22 / 22 papers shown
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security
SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security
Wei Zhao
Zhe Li
Jun Sun
AAML
205
0
0
04 Dec 2025
Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security
Wei Zhao
Zhe Li
Yige Li
Jun Sun
AAML
158
1
0
20 Nov 2025
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Hanbin Hong
Shuya Feng
Nima Naderloui
Shenao Yan
Jingyu Zhang
Biying Liu
Ali Arastehfard
Heqing Huang
Yuan Hong
AAML
321
2
0
17 Oct 2025
NeuroStrike: Neuron-Level Attacks on Aligned LLMs
NeuroStrike: Neuron-Level Attacks on Aligned LLMs
Lichao Wu
Sasha Behrouzi
Mohamadreza Rostami
Maximilian Thang
S. Picek
A. Sadeghi
AAMLMoMeLLMSV
348
7
0
15 Sep 2025
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
Chongwen Zhao
Kaizhu Huang
Kaizhu Huang
AAMLKELM
221
4
0
01 Sep 2025
LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks
LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks
Francesco Panebianco
Stefano Bonfanti
Francesco Trovò
Michele Carminati
AAML
175
0
0
01 Aug 2025
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal RepresentationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Xiaohu Li
Yunfeng Ning
Zepeng Bao
Mayi Xu
Jianhao Chen
T. Qian
AAML
297
5
0
08 Jul 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
402
9
0
11 Jun 2025
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee
Aeree Cho
Grace C. Kim
ShengYun Peng
Mansi Phute
Duen Horng Chau
LM&MAAI4CE
400
6
0
05 Jun 2025
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
Shaona Ghosh
Amrita Bhattacharjee
Yftah Ziser
Christopher Parisien
LLMSV
396
9
0
01 Jun 2025
Spectral Insights into Data-Oblivious Critical Layers in Large Language Models
Spectral Insights into Data-Oblivious Critical Layers in Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Xuyuan Liu
Lei Hsiung
Yaoqing Yang
Yujun Yan
AAML
349
3
0
31 May 2025
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
Zhexin Zhang
Xian Qi Loye
Victor Shea-Jay Huang
Junxiao Yang
Qi Zhu
...
Fei Mi
Lifeng Shang
Yingkang Wang
Hongning Wang
Shiyu Huang
ReLMLRM
405
18
0
21 May 2025
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
426
0
0
12 May 2025
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang
Christopher M. Poskitt
Jun Sun
614
56
0
24 Mar 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
472
3
0
21 Feb 2025
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model EditingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yi Wang
Fenghua Weng
Shangshang Yang
Zhan Qin
Minlie Huang
Wenjie Wang
KELMAAML
480
7
0
17 Feb 2025
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
Shenyi Zhang
Yuchen Zhai
Keyan Guo
Hongxin Hu
Shengnan Guo
Zheng Fang
Lingchen Zhao
Chao Shen
Cong Wang
Qian Wang
AAML
388
39
0
11 Feb 2025
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack DefenseNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Yang Ouyang
Hengrui Gu
Shuhang Lin
Qingfeng Lan
Jie Peng
B. Kailkhura
Tianlong Chen
Kaixiong Zhou
Kaixiong Zhou
AAML
376
10
0
05 Jan 2025
On the Role of Attention Heads in Large Language Model Safety
On the Role of Attention Heads in Large Language Model SafetyInternational Conference on Learning Representations (ICLR), 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Cunchun Li
Yongbin Li
566
58
0
17 Oct 2024
Defending against Jailbreak through Early Exit Generation of Large Language Models
Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
281
3
0
21 Aug 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
577
46
0
20 Jul 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
704
44
0
08 Jun 2024
1
Page 1 of 1