v1v2v3 (latest)

Universal Adversarial Triggers for Attacking and Analyzing NLP

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

20 August 2019

Papers citing "Universal Adversarial Triggers for Attacking and Analyzing NLP"

50 / 662 papers shown

Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification

Michael Backes

280

30 Jul 2024

Scaling Trends in Language Model Robustness

647

25 Jul 2024

Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning

284

23 Jul 2024

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

438

20 Jul 2024

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

327

19 Jul 2024

Robust Neural Information Retrieval: An Adversarial and Out-of-distribution Perspective

Yu-An Liu

Jiafeng Guo

400

09 Jul 2024

Raply: A profanity-mitigated rap generator

121

09 Jul 2024

AI Safety in Generative AI Large Language Models: A Survey

Lina Yao

364

06 Jul 2024

On the Low-Rank Parametrization of Reward Models for Controlled Language Generation

S. Troshin

Vlad Niculae

Antske Fokkens

190

05 Jul 2024

Defense Against Syntactic Textual Backdoor Attacks with Token Substitution

Xinglin Li

Xianwen He

Yao Li

Minhao Cheng

200

04 Jul 2024

Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers

Terry Tong

Lyne Tchapmi

Qin Liu

Muhao Chen

AAML SILM

283

04 Jul 2024

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

273

02 Jul 2024

Jailbreaking LLMs with Arabic Transliteration and Arabizi

Mengxin Zheng

184

26 Jun 2024

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm

Aakanksha

Arash Ahmadian

Beyza Ermis

Seraphina Goldfarb-Tarrant

Julia Kreutzer

Marzieh Fadaee

Sara Hooker

365

26 Jun 2024

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models

423

26 Jun 2024

FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts

Caroline Brun

Vassilina Nikoulina

292

25 Jun 2024

Data Augmentation of Multi-turn Psychological Dialogue via Knowledge-driven Progressive Thought Prompting

Liheng Chen

244

24 Jun 2024

Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Eric Wong

689

21 Jun 2024

Adversaries Can Misuse Combinations of Safe Models

Erik Jones

Anca Dragan

Jacob Steinhardt

256

20 Jun 2024

Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data

Nahema Marchal

239

19 Jun 2024

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens

Jizhong Han

208

19 Jun 2024

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

Lingrui Mei

Shenghua Liu

209

17 Jun 2024

Enhancing Question Answering on Charts Through Effective Pre-training TasksBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024

Ashim Gupta

142

14 Jun 2024

Analyzing Multi-Head Attention on Trojan BERT Models

Jingwei Wang

181

12 Jun 2024

CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models

Mengxin Zheng

292

04 Jun 2024

Tool Learning with Large Language Models: A Survey

Jun Xu

342

214

28 May 2024

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Ruofan Wang

252

28 May 2024

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Wangmeng Zuo

240

28 May 2024

Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Yiming Chen

Chen Zhang

Haizhou Li

225

23 May 2024

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based EvaluationNeural Information Processing Systems (NeurIPS), 2024

357

23 May 2024

Efficient Universal Goal Hijacking with Semantics-guided Prompt OrganizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

328

23 May 2024

Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity

592

22 May 2024

DEGAP: Dual Event-Guided Adaptive Prefixes for Templated-Based Event Argument Extraction with Slot Querying

724

22 May 2024

A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers

226

20 May 2024

Rethinking ChatGPT's Success: Usability and Cognitive Behaviors Enabled by Auto-regressive LLMs' Prompting

Xinzhe Li

Ming Liu

248

17 May 2024

Red Teaming Language Models for Contradictory DialoguesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

272

16 May 2024

PLeak: Prompt Leaking Attacks against Large Language Model ApplicationsConference on Computer and Communications Security (CCS), 2024

454

113

10 May 2024

Logical Negation Augmenting and Debiasing for Prompt-based Methods

197

08 May 2024

Hire Me or Not? Examining Language Model's Behavior with Occupation AttributesInternational Conference on Computational Linguistics (COLING), 2024

483

06 May 2024

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMsInternational Conference on Learning Representations (ICLR), 2024

191

05 May 2024

Assessing Adversarial Robustness of Large Language Models: An Empirical Study

Roger Wattenhofer

167

04 May 2024

Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Valeriia Cherepanova

James Zou

AAML

350

26 Apr 2024

Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge

208

21 Apr 2024

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

385

123

21 Apr 2024

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

349

235

19 Apr 2024

SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection

208

15 Apr 2024

Interactive Prompt Debugging with Sequence Salience

180

11 Apr 2024

Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data

228

08 Apr 2024

Goal-guided Generative Prompt Injection Attack on Large Language Models

293

06 Apr 2024

PID Control-Based Self-Healing to Improve the Robustness of Large Language Models

245

31 Mar 2024