ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.00614
  4. Cited By
Baseline Defenses for Adversarial Attacks Against Aligned Language
  Models

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

1 September 2023
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
    AAML
ArXivPDFHTML

Papers citing "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"

50 / 268 papers shown
Title
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
20
0
0
12 May 2025
POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models
POISONCRAFT: Practical Poisoning of Retrieval-Augmented Generation for Large Language Models
Yangguang Shao
Xinjie Lin
Haozheng Luo
Chengshang Hou
G. Xiong
J. Yu
Junzheng Shi
SILM
42
0
0
10 May 2025
System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection
System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection
Jiawei Guo
Haipeng Cai
SILM
AAML
16
0
0
10 May 2025
AgentXploit: End-to-End Redteaming of Black-Box AI Agents
AgentXploit: End-to-End Redteaming of Black-Box AI Agents
Zhun Wang
Vincent Siu
Zhe Ye
Tianneng Shi
Yuzhou Nie
Xuandong Zhao
Chenguang Wang
Wenbo Guo
Dawn Song
LLMAG
AAML
33
0
0
09 May 2025
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
Kalyan Nakka
Jimmy Dani
Ausmit Mondal
Nitesh Saxena
AAML
25
0
0
08 May 2025
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Kai Hu
Weichen Yu
L. Zhang
Alexander Robey
Andy Zou
Chengming Xu
Haoqi Hu
Matt Fredrikson
AAML
VLM
49
0
0
02 May 2025
Attack and defense techniques in large language models: A survey and new perspectives
Attack and defense techniques in large language models: A survey and new perspectives
Zhiyu Liao
Kang Chen
Yuanguo Lin
Kangkang Li
Yunxuan Liu
Hefeng Chen
Xingwang Huang
Yuanhui Yu
AAML
54
0
0
02 May 2025
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures
Francisco Aguilera-Martínez
Fernando Berzal
PILM
48
0
0
02 May 2025
Traceback of Poisoning Attacks to Retrieval-Augmented Generation
Traceback of Poisoning Attacks to Retrieval-Augmented Generation
Baolei Zhang
Haoran Xin
Minghong Fang
Zhuqing Liu
Biao Yi
Tong Li
Zheli Liu
SILM
AAML
59
0
0
30 Apr 2025
Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs
Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs
Pan Suo
Yu-ming Shang
San-Chuan Guo
Xi Zhang
SILM
AAML
45
0
0
30 Apr 2025
Prompt Injection Attack to Tool Selection in LLM Agents
Prompt Injection Attack to Tool Selection in LLM Agents
Jiawen Shi
Zenghui Yuan
Guiyao Tie
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
LLMAG
51
0
0
28 Apr 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An
Shiyue Zhang
Mark Dredze
54
0
0
25 Apr 2025
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections
Narek Maloyan
Dmitry Namiot
SILM
AAML
ELM
75
0
0
25 Apr 2025
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Tri Nguyen
Lohith Srikanth Pentapalli
Magnus Sieverding
Laurah Turner
Seth Overla
...
Michael Gharib
Matt Kelleher
Michael Shukis
Cameron Pawlik
Kelly Cohen
51
0
0
21 Apr 2025
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
Yupei Liu
Yuqi Jia
Jinyuan Jia
Dawn Song
Neil Zhenqiang Gong
AAML
34
0
0
15 Apr 2025
Exploring Backdoor Attack and Defense for LLM-empowered Recommendations
Exploring Backdoor Attack and Defense for LLM-empowered Recommendations
Liangbo Ning
Wenqi Fan
Qing Li
AAML
SILM
48
0
0
15 Apr 2025
Feature-Aware Malicious Output Detection and Mitigation
Feature-Aware Malicious Output Detection and Mitigation
Weilong Dong
Peiguang Li
Yu Tian
Xinyi Zeng
Fengdi Li
Sirui Wang
AAML
24
0
0
12 Apr 2025
Practical Poisoning Attacks against Retrieval-Augmented Generation
Practical Poisoning Attacks against Retrieval-Augmented Generation
Baolei Zhang
Y. Chen
Minghong Fang
Zhuqing Liu
Lihai Nie
Tong Li
Zheli Liu
SILM
AAML
57
0
0
04 Apr 2025
Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation
Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation
Liangbo Ning
Wenqi Fan
Qing Li
AAML
33
1
0
03 Apr 2025
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution
Zhuoran Yang
Jie Peng
Zhen Tan
Tianlong Chen
Yanyong Zhang
AAML
44
0
0
02 Apr 2025
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
Zhengchun Shang
Wenlan Wei
AAML
38
0
0
02 Apr 2025
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics
Shide Zhou
K. Wang
Ling Shi
H. Wang
44
0
0
01 Apr 2025
The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances
The Illusionist's Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances
Yining Wang
Y. Wang
Xi Li
Mi Zhang
Geng Hong
Min Yang
AAML
HILM
60
0
0
01 Apr 2025
Encrypted Prompt: Securing LLM Applications Against Unauthorized Actions
Encrypted Prompt: Securing LLM Applications Against Unauthorized Actions
Shih-Han Chan
AAML
46
0
0
29 Mar 2025
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models
Xunguang Wang
Wenxuan Wang
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Daoyuan Wu
Shuai Wang
48
0
0
23 Mar 2025
Augmented Adversarial Trigger Learning
Augmented Adversarial Trigger Learning
Zhe Wang
Yanjun Qi
46
0
0
16 Mar 2025
In-Context Defense in Computer Agents: An Empirical Study
Pei Yang
Hai Ci
Mike Zheng Shou
AAML
LLMAG
80
0
0
12 Mar 2025
Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation
Wenlong Meng
Fan Zhang
Wendao Yao
Zhenyuan Guo
Y. Li
Chengkun Wei
Wenzhi Chen
AAML
38
1
0
11 Mar 2025
CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation
Runqi Sui
AAML
32
0
0
10 Mar 2025
Safety Guardrails for LLM-Enabled Robots
Zachary Ravichandran
Alexander Robey
Vijay R. Kumar
George Pappas
Hamed Hassani
56
0
0
10 Mar 2025
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation
Wenhui Zhang
Huiyu Xu
Zhibo Wang
Zeqing He
Ziqi Zhu
Kui Ren
AAML
PILM
67
0
0
09 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
63
0
0
08 Mar 2025
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation
Yinuo Liu
Zenghui Yuan
Guiyao Tie
Jiawen Shi
Lichao Sun
Lichao Sun
Neil Zhenqiang Gong
36
1
0
08 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Siyuan Liang
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
49
1
0
05 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
88
0
0
03 Mar 2025
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Shiyu Xiang
Ansen Zhang
Yanfei Cao
Yang Fan
Ronghao Chen
AAML
60
0
0
26 Feb 2025
Shh, don't say that! Domain Certification in LLMs
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Bernard Ghanem
Philip H. S. Torr
Adel Bibi
45
1
0
26 Feb 2025
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Zhexin Zhang
Leqi Lei
Junxiao Yang
Xijie Huang
Yida Lu
...
Xianqi Lei
C. Pan
Lei Sha
H. Wang
Minlie Huang
AAML
43
0
0
24 Feb 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
54
0
0
24 Feb 2025
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Giulio Zizzo
Giandomenico Cornacchia
Kieran Fraser
Muhammad Zaid Hameed
Ambrish Rawat
Beat Buesser
Mark Purcell
Pin-Yu Chen
P. Sattigeri
Kush R. Varshney
AAML
43
1
0
24 Feb 2025
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
Rui Li
Peiyi Wang
Jingyuan Ma
Di Zhang
Lei Sha
Zhifang Sui
LLMAG
44
0
0
22 Feb 2025
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models
Qingsong Zou
Jingyu Xiao
Qing Li
Zhi Yan
Y. Wang
Li Xu
Wenxuan Wang
Kuofeng Gao
Ruoyu Li
Yong-jia Jiang
AAML
111
0
0
21 Feb 2025
Scaling Trends in Language Model Robustness
Scaling Trends in Language Model Robustness
Nikolhaus Howe
Michal Zajac
I. R. McKenzie
Oskar Hollinsworth
Tom Tseng
Aaron David Tucker
Pierre-Luc Bacon
Adam Gleave
101
1
0
21 Feb 2025
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
Huawei Lin
Yingjie Lao
Tong Geng
Tan Yu
Weijie Zhao
AAML
SILM
79
2
0
18 Feb 2025
Computational Safety for Generative AI: A Signal Processing Perspective
Computational Safety for Generative AI: A Signal Processing Perspective
Pin-Yu Chen
66
1
0
18 Feb 2025
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
Yi Wang
Fenghua Weng
S. Yang
Zhan Qin
Minlie Huang
Wenjie Wang
KELM
AAML
48
0
0
17 Feb 2025
OverThink: Slowdown Attacks on Reasoning LLMs
OverThink: Slowdown Attacks on Reasoning LLMs
A. Kumar
Jaechul Roh
A. Naseh
Marzena Karpinska
Mohit Iyyer
Amir Houmansadr
Eugene Bagdasarian
LRM
57
12
0
04 Feb 2025
Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models
Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models
Y. Gong
Zhuo Chen
Miaokun Chen
Fengchang Yu
Wei-Tsung Lu
XiaoFeng Wang
Xiaozhong Liu
J. Liu
AAML
SILM
58
0
0
03 Feb 2025
Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation
Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation
A. Naseh
Yuefeng Peng
Anshuman Suri
Harsh Chaudhari
Alina Oprea
Amir Houmansadr
SILM
AAML
RALM
49
0
0
01 Feb 2025
Trading Inference-Time Compute for Adversarial Robustness
Trading Inference-Time Compute for Adversarial Robustness
Wojciech Zaremba
Evgenia Nitishinskaya
Boaz Barak
Stephanie Lin
Sam Toyer
...
Rachel Dias
Eric Wallace
Kai Y. Xiao
Johannes Heidecke
Amelia Glaese
LRM
AAML
85
15
0
31 Jan 2025
123456
Next