v1v2 (latest)

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

28 May 2024

ArXiv (abs)PDF HTML Github (18★)

Papers citing "Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing"

22 / 22 papers shown

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

205

04 Dec 2025

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

158

20 Nov 2025

SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models

321

17 Oct 2025

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

348

15 Sep 2025

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Chongwen Zhao

Kaizhu Huang

AAML KELM

221

01 Sep 2025

LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks

175

01 Aug 2025

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal RepresentationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

297

08 Jul 2025

Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda

Chunpeng Ma

Masayuki Asahara

402

11 Jun 2025

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

400

05 Jun 2025

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

396

01 Jun 2025

Spectral Insights into Data-Oblivious Critical Layers in Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

349

31 May 2025

How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Zhexin Zhang

Xian Qi Loye

Victor Shea-Jay Huang

...

405

21 May 2025

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

426

12 May 2025

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang

Christopher M. Poskitt

Jun Sun

614

24 Mar 2025

SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention

472

21 Feb 2025

DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model EditingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

480

17 Feb 2025

JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

388

11 Feb 2025

Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack DefenseNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

376

05 Jan 2025

On the Role of Attention Heads in Large Language Model SafetyInternational Conference on Learning Representations (ICLR), 2024

Kun Wang

Yang Liu

Cunchun Li

Yongbin Li

566

17 Oct 2024

Defending against Jailbreak through Early Exit Generation of Large Language Models

281

21 Aug 2024

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

577

20 Jul 2024

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Xunguang Wang

Shuai Wang

Yingjiu Li

Yang Liu

Ning Liu

Juergen Rahmel

AAML

704

08 Jun 2024