v1v2 (latest)

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

6 February 2024

ArXiv (abs)PDF HTML HuggingFace (6 upvotes)Github (652★)

Papers citing "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"

50 / 487 papers shown

Many-Turn Jailbreaking

151

09 Aug 2025

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

117

08 Aug 2025

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

217

08 Aug 2025

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

114

07 Aug 2025

Building Effective Safety Guardrails in AI Education Tools

07 Aug 2025

Automatic LLM Red Teaming

193

06 Aug 2025

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

...

222

05 Aug 2025

RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

291

05 Aug 2025

Activation-Guided Local Editing for Jailbreaking Attacks

224

01 Aug 2025

Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report

...

190

01 Aug 2025

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong

Aditi Raghunathan

208

31 Jul 2025

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

168

30 Jul 2025

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

260

29 Jul 2025

Libra: Large Chinese-based Safeguard for AI Content

146

29 Jul 2025

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Andy Zou

Maxwell Lin

Eliot Krzysztof Jones

...

159

28 Jul 2025

The Blessing and Curse of Dimensionality in Safety Alignment

R. Teo

Laziz U. Abdullaev

Tan M. Nguyen

241

27 Jul 2025

PrompTrend: Continuous Community-Driven Vulnerability Discovery and Assessment for Large Language Models

205

25 Jul 2025

MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

171

25 Jul 2025

PurpCode: Reasoning for Safer Code Generation

...

448

25 Jul 2025

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

394

24 Jul 2025

Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems

144

23 Jul 2025

The Geometry of Harmfulness in LLMs through Subconcept Probing

McNair Shah

Saleena Angeline

Adhitya Rajendra Kumar

239

23 Jul 2025

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

254

16 Jul 2025

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

207

14 Jul 2025

Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

175

13 Jul 2025

Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak

381

09 Jul 2025

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

250

03 Jul 2025

Reasoning as an Adaptive Defense for Safety

176

01 Jul 2025

VERA: Variational Inference Framework for Jailbreaking Large Language Models

377

27 Jun 2025

RedCoder: Automated Multi-Turn Red Teaming for Code LLMs

177

25 Jun 2025

A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures

...

Muhammad Khurram Khan

Meng Han

LLMAG

359

24 Jun 2025

GRAF: Multi-turn Jailbreaking via Global Refinement and Active Fabrication

186

22 Jun 2025

From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Jingtong Su

Julia Kempe

Karen Ullrich

276

20 Jun 2025

Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

234

20 Jun 2025

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

205

20 Jun 2025

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

238

18 Jun 2025

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Maksym Andriushchenko

LLMAG ELM

314

17 Jun 2025

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

...

365

14 Jun 2025

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety

279

14 Jun 2025

Improving Large Language Model Safety with Contrastive Representation Learning

377

13 Jun 2025

InfoFlood: Jailbreaking Large Language Models with Information Overload

216

13 Jun 2025

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?

383

12 Jun 2025

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

306

11 Jun 2025

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

207

10 Jun 2025

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

275

10 Jun 2025

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

T. Krauß

Hamid Dashtbani

Alexandra Dmitrienko

157

09 Jun 2025

InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

Yifan Luo

Zhennan Zhou

Bin Dong

177

09 Jun 2025

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

319

09 Jun 2025

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

157

08 Jun 2025

Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values

233

08 Jun 2025