Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15043
Cited By
Universal and Transferable Adversarial Attacks on Aligned Language Models
27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Universal and Transferable Adversarial Attacks on Aligned Language Models"
50 / 938 papers shown
Title
Commercial AI, Conflict, and Moral Responsibility: A theoretical analysis and practical approach to the moral responsibilities associated with dual-use AI technology
Daniel Trusilo
David Danks
12
0
0
30 Jan 2024
Security and Privacy Challenges of Large Language Models: A Survey
B. Das
M. H. Amini
Yanzhao Wu
PILM
ELM
17
101
0
30 Jan 2024
Gradient-Based Language Model Red Teaming
Nevan Wichers
Carson E. Denison
Ahmad Beirami
6
25
0
30 Jan 2024
Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering
Yotam Wolf
Noam Wies
Dorin Shteyman
Binyamin Rothberg
Yoav Levine
Amnon Shashua
LLMSV
21
13
0
29 Jan 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
30
66
0
29 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
13
76
0
25 Jan 2024
MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
Xiaolong Jin
Zhuo Zhang
Xiangyu Zhang
8
3
0
25 Jan 2024
Fluent dreaming for language models
T. B. Thompson
Zygimantas Straznickas
Michael Sklar
VLM
22
2
0
24 Jan 2024
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
K. Wei
Lewis Hammond
...
Nitarshan Rajkumar
David M. Krueger
Noam Kolt
Lennart Heim
Markus Anderljung
13
31
0
23 Jan 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
Lingfeng Shen
Weiting Tan
Sihao Chen
Yunmo Chen
Jingyu Zhang
Haoran Xu
Boyuan Zheng
Philipp Koehn
Daniel Khashabi
21
37
0
23 Jan 2024
Raidar: geneRative AI Detection viA Rewriting
Chengzhi Mao
Carl Vondrick
Hao Wang
Junfeng Yang
DeLMO
31
23
0
23 Jan 2024
Red Teaming Visual Language Models
Mukai Li
Lei Li
Yuwei Yin
Masood Ahmed
Zhenguang Liu
Qi Liu
VLM
28
30
0
23 Jan 2024
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
Zaibin Zhang
Yongting Zhang
Lijun Li
Hongzhi Gao
Lijun Wang
Huchuan Lu
Feng Zhao
Yu Qiao
Jing Shao
LLMAG
12
29
0
22 Jan 2024
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
Pengyu Wang
Dong Zhang
Linyang Li
Chenkun Tan
Xinghao Wang
Ke Ren
Botian Jiang
Xipeng Qiu
LLMSV
21
39
0
20 Jan 2024
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang
Fengqing Jiang
Zidi Xiong
Bhaskar Ramasubramanian
Radha Poovendran
Bo Li
LRM
SILM
24
38
0
20 Jan 2024
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
Adib Hasan
Ileana Rugina
Alex Wang
AAML
49
22
0
19 Jan 2024
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan
Zhiwei He
Lingzhong Dong
Yiming Wang
Ruijie Zhao
...
Binglin Zhou
Fangqi Li
Zhuosheng Zhang
Rui Wang
Gongshen Liu
ELM
23
58
0
18 Jan 2024
All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
Kazuhiro Takemoto
34
21
0
18 Jan 2024
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
T. Klein
Moin Nabi
14
1
0
16 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
18
253
0
12 Jan 2024
Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang
Liang Ding
Lefei Zhang
Dacheng Tao
LLMSV
17
15
0
12 Jan 2024
ML-On-Rails: Safeguarding Machine Learning Models in Software Systems A Case Study
Hala Abdelkader
Mohamed Abdelrazek
Scott Barnett
Jean-Guy Schneider
Priya Rani
Rajesh Vasa
19
3
0
12 Jan 2024
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini
Zhili Feng
Avi Schwarzschild
Zachary Chase Lipton
J. Zico Kolter
MU
CLL
38
141
0
11 Jan 2024
Combating Adversarial Attacks with Multi-Agent Debate
Steffi Chern
Zhen Fan
Andy Liu
AAML
37
5
0
11 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
Zujie Wen
Ke Xu
Qi Li
55
56
0
11 Jan 2024
Fighting Fire with Fire: Adversarial Prompting to Generate a Misinformation Detection Dataset
Shrey Satapara
Parth Mehta
Debasis Ganguly
Sandip J Modha
29
0
0
09 Jan 2024
Understanding LLMs: A Comprehensive Overview from Training to Inference
Yi-Hsueh Liu
Haoyang He
Tianle Han
Xu-Yao Zhang
Mengyuan Liu
...
Xintao Hu
Tuo Zhang
Ning Qiang
Tianming Liu
Bao Ge
SyDa
14
64
0
04 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
64
95
0
03 Jan 2024
A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models
Daniel Wankit Yip
Aysan Esmradi
C. Chan
AAML
23
11
0
02 Jan 2024
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness
Neeraj Varshney
Pavel Dolin
Agastya Seth
Chitta Baral
AAML
ELM
14
47
0
30 Dec 2023
The Tyranny of Possibilities in the Design of Task-Oriented LLM Systems: A Scoping Survey
Dhruv Dhamani
Mary Lou Maher
27
1
0
29 Dec 2023
Large Language Models for Conducting Advanced Text Analytics Information Systems Research
Benjamin Ampel
Chi-Heng Yang
J. Hu
Hsinchun Chen
21
7
0
27 Dec 2023
Alleviating Hallucinations of Large Language Models through Induced Hallucinations
Yue Zhang
Leyang Cui
Wei Bi
Shuming Shi
HILM
34
49
0
25 Dec 2023
MetaAID 2.5: A Secure Framework for Developing Metaverse Applications via Large Language Models
Hongyin Zhu
31
6
0
22 Dec 2023
Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks
Haz Sameen Shahgir
Xianghao Kong
Greg Ver Steeg
Yue Dong
8
5
0
22 Dec 2023
Training Neural Networks with Internal State, Unconstrained Connectivity, and Discrete Activations
Alexander Grushin
AI4CE
14
0
0
22 Dec 2023
Exploiting Novel GPT-4 APIs
Kellin Pelrine
Mohammad Taufeeque
Michal Zajkac
Euan McLean
Adam Gleave
SILM
18
20
0
21 Dec 2023
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega
Isha Chaudhary
Changming Xu
Gagandeep Singh
AAML
14
18
0
19 Dec 2023
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models
Jiawei Zhao
Kejiang Chen
Xianjian Yuan
Yuang Qi
Weiming Zhang
Neng H. Yu
59
8
0
15 Dec 2023
The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
Rongwu Xu
Brian S. Lin
Shujian Yang
Tianqi Zhang
Weiyan Shi
Tianwei Zhang
Zhixuan Fang
Wei Xu
Han Qiu
39
50
0
14 Dec 2023
Exploring Transferability for Randomized Smoothing
Kai Qiu
Huishuai Zhang
Zhirong Wu
Stephen Lin
AAML
10
1
0
14 Dec 2023
Causality Analysis for Evaluating the Security of Large Language Models
Wei Zhao
Zhe Li
Junfeng Sun
19
9
0
13 Dec 2023
Maatphor: Automated Variant Analysis for Prompt Injection Attacks
Ahmed Salem
Andrew J. Paverd
Boris Köpf
19
8
0
12 Dec 2023
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
22
38
0
12 Dec 2023
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
Yu Fu
Yufei Li
Wen Xiao
Cong Liu
Yue Dong
AAML
29
5
0
12 Dec 2023
METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities
Sangwon Hyun
Mingyu Guo
Muhammad Ali Babar
25
8
0
11 Dec 2023
NLLG Quarterly arXiv Report 09/23: What are the most influential current AI Papers?
Ran Zhang
Aida Kostikova
Christoph Leiter
Jonas Belouadi
Daniil Larionov
Yanran Chen
Vivian Fresen
Steffen Eger
35
0
0
09 Dec 2023
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
Zhuo Zhang
Guangyu Shen
Guanhong Tao
Shuyang Cheng
Xiangyu Zhang
28
13
0
08 Dec 2023
Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks
Shuli Jiang
S. Kadhe
Yi Zhou
Ling Cai
Nathalie Baracaldo
SILM
AAML
12
13
0
07 Dec 2023
Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
Yanrui Du
Sendong Zhao
Ming Ma
Yuhan Chen
Bing Qin
26
15
0
07 Dec 2023
Previous
1
2
3
...
15
16
17
18
19
Next