JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

11 February 2025

Papers citing "JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation"

19 / 19 papers shown

From static to adaptive: immune memory-based jailbreak detection for large language models

Jun Leng

Litian Zhang

Xi Zhang

Ruihan Hu

Zhuting Fang

Xi Zhang

AAML

223

03 Dec 2025

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

183

24 Nov 2025

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

17 Nov 2025

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

284

16 Nov 2025

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

199

15 Nov 2025

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

135

26 Oct 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

147

19 Oct 2025

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

Masahiro Kaneko

Timothy Baldwin

AAML

193

19 Oct 2025

SASER: Stego attacks on open-source LLMs

171

12 Oct 2025

HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing

178

28 Sep 2025

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

117

27 Sep 2025

Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain

145

04 Sep 2025

Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda

Chunpeng Ma

Masayuki Asahara

318

11 Jun 2025

JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift

318

28 Apr 2025

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

...

386

13 Apr 2025

Beyond Prompts: Space-Time Decoupling Control-Plane Jailbreaks in LLM Structured Output

...

Yuan Wen

Chunwei Xia

Zheng Wang

Xiaobing Feng

Huimin Cui

AAML

393

31 Mar 2025

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Tinghao Xie

Xiangyu Qi

Yi Zeng

Yangsibo Huang

Udari Madhushani Sehwag

...

Bo Li

Kai Li

426

139

20 Jun 2024

OR-Bench: An Over-Refusal Benchmark for Large Language Models

736

31 May 2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024

Maksym Andriushchenko

Francesco Croce

Nicolas Flammarion

AAML

793

374

02 Apr 2024