ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXivPDFHTML

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 938 papers shown
Title
Backtracking Improves Generation Safety
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
41
6
0
22 Sep 2024
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement
  Learning-Based Jailbreak Approach
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach
Zhihao Lin
Wei Ma
Mingyi Zhou
Yanjie Zhao
Haoyu Wang
Yang Liu
Jun Wang
Li Li
AAML
32
5
0
21 Sep 2024
Prompt Obfuscation for Large Language Models
Prompt Obfuscation for Large Language Models
David Pape
Thorsten Eisenhofer
Thorsten Eisenhofer
Lea Schönherr
AAML
33
2
0
17 Sep 2024
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks
  against RAG-based Inference in Scale and Severity Using Jailbreaking
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking
Stav Cohen
Ron Bitton
Ben Nassi
39
4
0
12 Sep 2024
Securing Vision-Language Models with a Robust Encoder Against Jailbreak
  and Adversarial Attacks
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain
Ahmed Imteaj
AAML
VLM
38
3
0
11 Sep 2024
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting
  LLMs
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
Lijia Lv
Weigang Zhang
Xuehai Tang
Jie Wen
Feng Liu
Jizhong Han
Songlin Hu
AAML
29
2
0
11 Sep 2024
DiPT: Enhancing LLM reasoning through diversified perspective-taking
DiPT: Enhancing LLM reasoning through diversified perspective-taking
H. Just
Mahavir Dabas
Lifu Huang
Ming Jin
Ruoxi Jia
LRM
37
1
0
10 Sep 2024
Towards Safe Multilingual Frontier AI
Towards Safe Multilingual Frontier AI
Artūrs Kanepajs
Vladimir Ivanov
Richard Moulange
31
1
0
06 Sep 2024
An overview of domain-specific foundation model: key technologies,
  applications and challenges
An overview of domain-specific foundation model: key technologies, applications and challenges
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALM
VLM
61
4
0
06 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
52
1
0
05 Sep 2024
ContextCite: Attributing Model Generation to Context
ContextCite: Attributing Model Generation to Context
Benjamin Cohen-Wang
Harshay Shah
Kristian Georgiev
Aleksander Madry
LRM
30
18
0
01 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals
  in Large Language Models
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
31
11
0
01 Sep 2024
Acceptable Use Policies for Foundation Models
Acceptable Use Policies for Foundation Models
Kevin Klyman
31
14
0
29 Aug 2024
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational
  Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
Aman Priyanshu
Supriti Vijay
AAML
23
1
0
28 Aug 2024
Legilimens: Practical and Unified Content Moderation for Large Language
  Model Services
Legilimens: Practical and Unified Content Moderation for Large Language Model Services
Jialin Wu
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Jiayang Xu
Xinfeng Li
Wenyuan Xu
32
6
0
28 Aug 2024
AgentMonitor: A Plug-and-Play Framework for Predictive and Secure
  Multi-Agent Systems
AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems
Chi-Min Chan
Jianxuan Yu
Weize Chen
Chunyang Jiang
Xinyu Liu
Weijie Shi
Zhiyuan Liu
Wei Xue
Yike Guo
LLMAG
38
0
0
27 Aug 2024
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language
  Models
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
Hongfu Liu
Yuxi Xie
Ye Wang
Michael Shieh
59
2
0
27 Aug 2024
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang
Philip H. S. Torr
Mohamed Elhoseiny
Adel Bibi
69
9
0
27 Aug 2024
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large
  Language Models
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
Yige Li
Hanxun Huang
Yunhan Zhao
Xingjun Ma
Jun Sun
AAML
SILM
38
19
0
23 Aug 2024
LLM-PBE: Assessing Data Privacy in Large Language Models
LLM-PBE: Assessing Data Privacy in Large Language Models
Qinbin Li
Junyuan Hong
Chulin Xie
Jeffrey Tan
Rachel Xin
...
Dan Hendrycks
Zhangyang Wang
Bo Li
Bingsheng He
Dawn Song
ELM
PILM
36
12
0
23 Aug 2024
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
Mamadou Keita
W. Hamidouche
Hessen Bougueffa Eutamene
Abdelmalik Taleb-Ahmed
Abdenour Hadid
VLM
82
1
0
22 Aug 2024
Approaching Deep Learning through the Spectral Dynamics of Weights
Approaching Deep Learning through the Spectral Dynamics of Weights
David Yunis
Kumar Kshitij Patel
Samuel Wheeler
Pedro H. P. Savarese
Gal Vardi
Karen Livescu
Michael Maire
Matthew R. Walter
47
3
0
21 Aug 2024
Efficient Detection of Toxic Prompts in Large Language Models
Efficient Detection of Toxic Prompts in Large Language Models
Yi Liu
Junzhe Yu
Huijia Sun
Ling Shi
Gelei Deng
Yuqi Chen
Yang Liu
29
4
0
21 Aug 2024
EEG-Defender: Defending against Jailbreak through Early Exit Generation
  of Large Language Models
EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
27
0
0
21 Aug 2024
Learning Randomized Algorithms with Transformers
Learning Randomized Algorithms with Transformers
J. Oswald
Seijin Kobayashi
Yassir Akram
Angelika Steger
AAML
40
0
0
20 Aug 2024
Towards Robust Knowledge Unlearning: An Adversarial Framework for
  Assessing and Improving Unlearning Robustness in Large Language Models
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models
Hongbang Yuan
Zhuoran Jin
Pengfei Cao
Yubo Chen
Kang Liu
Jun Zhao
AAML
ELM
MU
44
1
0
20 Aug 2024
Probing the Safety Response Boundary of Large Language Models via Unsafe
  Decoding Path Generation
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation
Haoyu Wang
Bingzhe Wu
Yatao Bian
Yongzhe Chang
Xueqian Wang
Peilin Zhao
64
2
0
20 Aug 2024
Promoting Equality in Large Language Models: Identifying and Mitigating
  the Implicit Bias based on Bayesian Theory
Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory
Yongxin Deng
Xihe Qiu
Xiaoyu Tan
Jing Pan
Chen Jue
Zhijun Fang
Yinghui Xu
Wei Chu
Yuan Qi
26
3
0
20 Aug 2024
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles
Zhilong Wang
Haizhou Wang
Nanqing Luo
Lan Zhang
Xiaoyan Sun
Yebo Cao
Peng Liu
25
0
0
20 Aug 2024
Characterizing and Evaluating the Reliability of LLMs against Jailbreak
  Attacks
Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks
Kexin Chen
Yi Liu
Dongxia Wang
Jiaying Chen
Wenhai Wang
44
1
0
18 Aug 2024
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
Yulin Chen
Haoran Li
Zihao Zheng
Zihao Zheng
Yangqiu Song
Bryan Hooi
43
6
0
17 Aug 2024
$\textit{MMJ-Bench}$: A Comprehensive Study on Jailbreak Attacks and
  Defenses for Vision Language Models
MMJ-Bench\textit{MMJ-Bench}MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
Fenghua Weng
Yue Xu
Chengyan Fu
Wenjie Wang
AAML
35
1
0
16 Aug 2024
Prefix Guidance: A Steering Wheel for Large Language Models to Defend
  Against Jailbreak Attacks
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
Jiawei Zhao
Kejiang Chen
Xiaojian Yuan
Weiming Zhang
AAML
26
2
0
15 Aug 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov
  Decision Processes and Tree Search
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
Robert J. Moss
AAML
26
0
0
11 Aug 2024
A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered
  Applications are Vulnerable to PromptWares
A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares
Stav Cohen
Ron Bitton
Ben Nassi
SILM
33
5
0
09 Aug 2024
Multi-Turn Context Jailbreak Attack on Large Language Models From First
  Principles
Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Hui Li
AAML
34
11
0
08 Aug 2024
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language
  Models
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
Prannaya Gupta
Le Qi Yau
Hao Han Low
I-Shiang Lee
Hugo Maximus Lim
...
Jia Hng Koh
Dar Win Liew
Rishabh Bhardwaj
Rajat Bhardwaj
Soujanya Poria
ELM
LM&MA
52
4
0
07 Aug 2024
Prompt and Prejudice
Prompt and Prejudice
Lorenzo Berlincioni
Luca Cultrera
Federico Becattini
Marco Bertini
A. Bimbo
38
0
0
07 Aug 2024
EnJa: Ensemble Jailbreak on Large Language Models
EnJa: Ensemble Jailbreak on Large Language Models
Jiahao Zhang
Zilong Wang
Ruofan Wang
Xingjun Ma
Yu-Gang Jiang
AAML
20
1
0
07 Aug 2024
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large
  Language Models?
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?
Mohammad Bahrami Karkevandi
Nishant Vishwamitra
Peyman Najafirad
AAML
43
1
0
05 Aug 2024
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language
  Models
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
Muxi Diao
Rumei Li
Shiyang Liu
Guogang Liao
Jingang Wang
Xunliang Cai
Weiran Xu
AAML
49
1
0
05 Aug 2024
Strong and weak alignment of large language models with human values
Strong and weak alignment of large language models with human values
Mehdi Khamassi
Marceau Nahon
Raja Chatila
ALM
32
9
0
05 Aug 2024
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
Sahra Ghalebikesabi
Eugene Bagdasaryan
Ren Yi
Itay Yona
Ilia Shumailov
...
Robert Stanforth
Leonard Berrada
Pushmeet Kohli
Po-Sen Huang
Borja Balle
26
4
0
05 Aug 2024
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
Zi Liang
Haibo Hu
Qingqing Ye
Yaxin Xiao
Haoyang Li
AAML
ELM
SILM
48
5
0
05 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
38
8
0
02 Aug 2024
Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion
Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion
Honglei Miao
Fan Ma
Ruijie Quan
Kun Zhan
Yi Yang
AAML
36
0
0
01 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
47
38
0
01 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
26
19
0
31 Jul 2024
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction
  Amplification
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
Boyang Zhang
Yicong Tan
Yun Shen
Ahmed Salem
Michael Backes
Savvas Zannettou
Yang Zhang
LLMAG
AAML
40
14
0
30 Jul 2024
Can Editing LLMs Inject Harm?
Can Editing LLMs Inject Harm?
Canyu Chen
Baixiang Huang
Zekun Li
Zhaorun Chen
Shiyang Lai
...
Xifeng Yan
William Wang
Philip H. S. Torr
Dawn Song
Kai Shu
KELM
38
11
0
29 Jul 2024
Previous
123...789...171819
Next