ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXivPDFHTML

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 938 papers shown
Title
Diverse and Effective Red Teaming with Auto-generated Rewards and
  Multi-step Reinforcement Learning
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
3
0
24 Dec 2024
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang
Hao Li
Junda Zhu
Xinyuan Wang
C. Pan
Minlie Huang
Lei Sha
92
0
0
23 Dec 2024
Cannot or Should Not? Automatic Analysis of Refusal Composition in
  IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum
Christoph Schnabl
Gabor Hollbeck
Silas Alberti
Philip Blinde
Marvin von Hagen
90
2
0
22 Dec 2024
The Task Shield: Enforcing Task Alignment to Defend Against Indirect
  Prompt Injection in LLM Agents
The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents
Feiran Jia
Tong Wu
Xin Qin
Anna Squicciarini
LLMAG
AAML
80
4
0
21 Dec 2024
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
72
0
0
19 Dec 2024
Mitigating Adversarial Attacks in LLMs through Defensive Suffix
  Generation
Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation
Minkyoung Kim
Yunha Kim
Hyeram Seo
Heejung Choi
Jiye Han
...
Hyoje Jung
Byeolhee Kim
Young-Hak Kim
Sanghyun Park
Tae Joon Jun
AAML
70
0
0
18 Dec 2024
Adversarial Hubness in Multi-Modal Retrieval
Adversarial Hubness in Multi-Modal Retrieval
Tingwei Zhang
Fnu Suya
Rishi Jha
Collin Zhang
Vitaly Shmatikov
AAML
81
1
0
18 Dec 2024
Concept-ROT: Poisoning Concepts in Large Language Models with Model
  Editing
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes
Marco Christiani
David Shriver
Marissa Connor
KELM
80
1
0
17 Dec 2024
Jailbreaking? One Step Is Enough!
Jailbreaking? One Step Is Enough!
Weixiong Zheng
Peijian Zeng
Y. Li
Hongyan Wu
Nankai Lin
J. Chen
Aimin Yang
Y. Zhou
AAML
81
0
0
17 Dec 2024
LLMs Can Simulate Standardized Patients via Agent Coevolution
LLMs Can Simulate Standardized Patients via Agent Coevolution
Zhuoyun Du
Lujie Zheng
Renjun Hu
Yuyang Xu
X. Li
Ying Sun
Wei Chen
Jian Wu
Haolei Cai
Haohao Ying
LM&MA
79
3
0
16 Dec 2024
The Superalignment of Superhuman Intelligence with Large Language Models
The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang
Yingkang Wang
Shiyao Cui
Pei Ke
J. Tang
103
1
0
15 Dec 2024
No Free Lunch for Defending Against Prefilling Attack by In-Context
  Learning
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
Zhiyu Xue
Guangliang Liu
Bocheng Chen
K. Johnson
Ramtin Pedarsani
AAML
68
0
0
13 Dec 2024
FlexLLM: Exploring LLM Customization for Moving Target Defense on
  Black-Box LLMs Against Jailbreak Attacks
FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
Bocheng Chen
Hanqing Guo
Qiben Yan
AAML
63
0
0
10 Dec 2024
Targeting the Core: A Simple and Effective Method to Attack RAG-based
  Agents via Direct LLM Manipulation
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation
Xuying Li
Zhuo Li
Yuji Kosuga
Yasuhiro Yoshida
Victor Bian
AAML
86
2
0
05 Dec 2024
Time-Reversal Provides Unsupervised Feedback to LLMs
Time-Reversal Provides Unsupervised Feedback to LLMs
Yerram Varun
Rahul Madhavan
Sravanti Addepalli
A. Suggala
Karthikeyan Shanmugam
Prateek Jain
LRM
SyDa
64
0
0
03 Dec 2024
Improved Large Language Model Jailbreak Detection via Pretrained
  Embeddings
Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
Erick Galinkin
Martin Sablotny
71
0
0
02 Dec 2024
Yi-Lightning Technical Report
Yi-Lightning Technical Report
01. AI
:
Alan Wake
Albert Wang
Bei Chen
...
Yuxuan Sha
Zhaodong Yan
Zhiyuan Liu
Zirui Zhang
Zonghong Dai
OSLM
102
3
0
02 Dec 2024
Quantized Delta Weight Is Safety Keeper
Quantized Delta Weight Is Safety Keeper
Yule Liu
Zhen Sun
Xinlei He
Xinyi Huang
85
2
0
29 Nov 2024
On the Adversarial Robustness of Instruction-Tuned Large Language Models
  for Code
On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code
Md. Imran Hossen
X. Hei
AAML
ELM
58
0
0
29 Nov 2024
PEFT-as-an-Attack! Jailbreaking Language Models during Federated
  Parameter-Efficient Fine-Tuning
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
Shenghui Li
Edith C. H. Ngai
Fanghua Ye
Thiemo Voigt
SILM
86
6
0
28 Nov 2024
Politicians vs ChatGPT. A study of presuppositions in French and Italian
  political communication
Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication
Davide Garassino
Vivana Masia
Nicola Brocca
Alice Delorme Benites
62
0
0
27 Nov 2024
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for
  Jailbreaking Vision-Language Models
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao
Bryan Hooi
J. Liu
Kai-Wei Chang
Zi Huang
Yujun Cai
AAML
92
0
0
27 Nov 2024
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
Zhi-Yi Chin
Kuan-Chen Mu
Mario Fritz
Pin-Yu Chen
DiffM
83
0
0
25 Nov 2024
RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented
  Generation Applications with Agent-based Attacks
RAG-Thief: Scalable Extraction of Private Data from Retrieval-Augmented Generation Applications with Agent-based Attacks
Changyue Jiang
Xudong Pan
Geng Hong
Chenfu Bao
Min Yang
SILM
72
9
0
21 Nov 2024
Global Challenge for Safe and Secure LLMs Track 1
Global Challenge for Safe and Secure LLMs Track 1
Xiaojun Jia
Yihao Huang
Yang Liu
Peng Yan Tan
Weng Kuan Yau
...
Yan Wang
Rick Siow Mong Goh
Liangli Zhen
Yingjie Zhang
Zhe Zhao
ELM
AILaw
64
0
0
21 Nov 2024
Rethinking the Intermediate Features in Adversarial Attacks: Misleading
  Robotic Models via Adversarial Distillation
Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation
Ke Zhao
Huayang Huang
Miao Li
Yu Wu
AAML
71
0
0
21 Nov 2024
CROW: Eliminating Backdoors from Large Language Models via Internal
  Consistency Regularization
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Nay Myat Min
Long H. Pham
Yige Li
Jun Sun
AAML
64
3
0
18 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
52
3
0
17 Nov 2024
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding
  Conversations
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Jianfeng Chi
Ujjwal Karn
Hongyuan Zhan
Eric Michael Smith
Javier Rando
Yiming Zhang
Kate Plawiak
Zacharie Delpierre Coudert
Kartikeya Upasani
Mahesh Pasupuleti
MLLM
3DH
42
19
0
15 Nov 2024
Jailbreak Attacks and Defenses against Multimodal Generative Models: A
  Survey
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey
Xuannan Liu
Xing Cui
Peipei Li
Zekun Li
Huaibo Huang
Shuhan Xia
Miaoxuan Zhang
Yueying Zou
Ran He
AAML
60
6
0
14 Nov 2024
DROJ: A Prompt-Driven Attack against Large Language Models
DROJ: A Prompt-Driven Attack against Large Language Models
Leyang Hu
Boran Wang
29
0
0
14 Nov 2024
New Emerged Security and Privacy of Pre-trained Model: a Survey and
  Outlook
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAML
ELM
PILM
56
1
0
12 Nov 2024
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of
  Quantization on Model Alignment
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
Yannis Belkhiter
Giulio Zizzo
S. Maffeis
38
1
0
11 Nov 2024
A Survey of AI-Related Cyber Security Risks and Countermeasures in
  Mobility-as-a-Service
A Survey of AI-Related Cyber Security Risks and Countermeasures in Mobility-as-a-Service
Kai-Fung Chu
Haiyue Yuan
Jinsheng Yuan
Weisi Guo
Nazmiye Balta-Ozkan
Shujun Li
29
2
0
08 Nov 2024
Unfair Alignment: Examining Safety Alignment Across Vision Encoder
  Layers in Vision-Language Models
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models
Saketh Bachu
Erfan Shayegani
Trishna Chakraborty
Rohit Lal
Arindam Dutta
Chengyu Song
Yue Dong
Nael B. Abu-Ghazaleh
A. Roy-Chowdhury
29
0
0
06 Nov 2024
Diversity Helps Jailbreak Large Language Models
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
99
0
0
06 Nov 2024
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM
  Safety Alignment
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
Jason Vega
Junsheng Huang
Gaokai Zhang
Hangoo Kang
Minjia Zhang
Gagandeep Singh
34
0
0
05 Nov 2024
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under
  Misleading Scenarios
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang
Mengxi Gao
Yibo Yan
Xin Zou
Yanggan Gu
Aiwei Liu
Xuming Hu
42
4
0
05 Nov 2024
Attacking Vision-Language Computer Agents via Pop-ups
Attacking Vision-Language Computer Agents via Pop-ups
Yanzhe Zhang
Tao Yu
Diyi Yang
AAML
VLM
30
17
0
04 Nov 2024
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse
  Activation Control
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Yuxin Xiao
Chaoqun Wan
Yonggang Zhang
Wenxiao Wang
Binbin Lin
Xiaofei He
Xu Shen
Jieping Ye
24
0
0
04 Nov 2024
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Eric Ma
Gaurav Verma
Srijan Kumar
60
5
0
03 Nov 2024
Achieving Domain-Independent Certified Robustness via Knowledge
  Continuity
Achieving Domain-Independent Certified Robustness via Knowledge Continuity
Alan Sun
Chiyu Ma
Kenneth Ge
Soroush Vosoughi
31
0
0
03 Nov 2024
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao
Kejiang Chen
W. Zhang
Nenghai Yu
AAML
38
0
0
03 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch
Constantin Weisser
Severin Field
Helen Yannakoudakis
Stephen Casper
37
1
0
02 Nov 2024
Plentiful Jailbreaks with String Compositions
Plentiful Jailbreaks with String Compositions
Brian R. Y. Huang
AAML
41
2
0
01 Nov 2024
Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Zhipeng Wei
Yuqi Liu
N. Benjamin Erichson
AAML
53
1
0
01 Nov 2024
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
Yulin Chen
Haoran Li
Zihao Zheng
Y. Song
Dekai Wu
Bryan Hooi
SILM
AAML
47
4
0
01 Nov 2024
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs
Muhammed Saeed
Elgizouli Mohamed
Mukhtar Mohamed
Shaina Raza
Muhammad Abdul-Mageed
Shady Shehata
38
0
0
31 Oct 2024
Transferable Ensemble Black-box Jailbreak Attacks on Large Language
  Models
Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models
Yiqi Yang
Hongye Fu
AAML
19
0
0
31 Oct 2024
Transformation-Invariant Learning and Theoretical Guarantees for OOD
  Generalization
Transformation-Invariant Learning and Theoretical Guarantees for OOD Generalization
Omar Montasser
Han Shao
Emmanuel Abbe
OOD
34
1
0
30 Oct 2024
Previous
123456...171819
Next