ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02483
  4. Cited By
Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How Does LLM Safety Training Fail?

5 July 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Jailbroken: How Does LLM Safety Training Fail?"

34 / 634 papers shown
Title
Let the Models Respond: Interpreting Language Model Detoxification
  Through the Lens of Prompt Dependence
Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence
Daniel Scalena
Gabriele Sarti
Malvina Nissim
Elisabetta Fersini
11
0
0
01 Sep 2023
Image Hijacks: Adversarial Images can Control Generative Models at
  Runtime
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Luke Bailey
Euan Ong
Stuart J. Russell
Scott Emmons
VLM
MLLM
11
78
0
01 Sep 2023
Cultural Alignment in Large Language Models: An Explanatory Analysis
  Based on Hofstede's Cultural Dimensions
Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions
Reem I. Masoud
Ziquan Liu
Martin Ferianc
Philip C. Treleaven
Miguel R. D. Rodrigues
16
49
0
25 Aug 2023
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Yuxia Wang
Haonan Li
Xudong Han
Preslav Nakov
Timothy Baldwin
19
102
0
25 Aug 2023
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and
  Vulnerabilities
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities
Maximilian Mozes
Xuanli He
Bennett Kleinberg
Lewis D. Griffin
31
75
0
24 Aug 2023
Sparks of Large Audio Models: A Survey and Outlook
Sparks of Large Audio Models: A Survey and Outlook
S. Latif
Moazzam Shoukat
Fahad Shamshad
Muhammad Usama
Yi Ren
...
Wenwu Wang
Xulong Zhang
Roberto Togneri
Erik Cambria
Björn W. Schuller
LM&MA
AuLLM
29
36
0
24 Aug 2023
Large Language Models Sensitivity to The Order of Options in
  Multiple-Choice Questions
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
Pouya Pezeshkpour
Estevam R. Hruschka
LRM
8
123
0
22 Aug 2023
Using Large Language Models for Cybersecurity Capture-The-Flag
  Challenges and Certification Questions
Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions
W. Tann
Yuancheng Liu
Jun Heng Sim
C. Seah
E. Chang
ELM
9
30
0
21 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
56
231
0
12 Aug 2023
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak
  Prompts on Large Language Models
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Xinyue Shen
Z. Chen
Michael Backes
Yun Shen
Yang Zhang
SILM
28
241
0
07 Aug 2023
AgentBench: Evaluating LLMs as Agents
AgentBench: Evaluating LLMs as Agents
Xiao Liu
Hao Yu
Hanchen Zhang
Yifan Xu
Xuanyu Lei
...
Yu-Chuan Su
Huan Sun
Minlie Huang
Yuxiao Dong
Jie Tang
ELM
LLMAG
29
257
0
07 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in
  Large Language Models
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger
Hannah Rose Kirk
Bertie Vidgen
Giuseppe Attanasio
Federico Bianchi
Dirk Hovy
ALM
ELM
AILaw
21
119
0
02 Aug 2023
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt
  Injection
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
Jun Yan
Vikas Yadav
Shiyang Li
Lichang Chen
Zheng Tang
Hai Wang
Vijay Srinivasan
Xiang Ren
Hongxia Jin
SILM
15
75
0
31 Jul 2023
Deception Abilities Emerged in Large Language Models
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
23
73
0
31 Jul 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from
  Human Feedback
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
34
468
0
27 Jul 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
16
1,236
0
27 Jul 2023
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal
  Language Models
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
Erfan Shayegani
Yue Dong
Nael B. Abu-Ghazaleh
20
126
0
26 Jul 2023
LLM Censorship: A Machine Learning Challenge or a Computer Security
  Problem?
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?
David Glukhov
Ilia Shumailov
Y. Gal
Nicolas Papernot
V. Papyan
AAML
ELM
21
55
0
20 Jul 2023
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output
  Robustness of Large Language Models
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
Huachuan Qiu
Shuai Zhang
Anqi Li
Hongliang He
Zhenzhong Lan
ALM
29
46
0
17 Jul 2023
Effective Prompt Extraction from Language Models
Effective Prompt Extraction from Language Models
Yiming Zhang
Nicholas Carlini
Daphne Ippolito
MIACV
SILM
22
35
0
13 Jul 2023
A Comprehensive Overview of Large Language Models
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Saeed Mian
OffRL
46
499
0
12 Jul 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi
Kaixuan Huang
Ashwinee Panda
Peter Henderson
Mengdi Wang
Prateek Mittal
AAML
23
136
0
22 Jun 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Stephen Casper
Jason Lin
Joe Kwon
Gatlen Culp
Dylan Hadfield-Menell
AAML
8
83
0
15 Jun 2023
Science in the Era of ChatGPT, Large Language Models and Generative AI:
  Challenges for Research Ethics and How to Respond
Science in the Era of ChatGPT, Large Language Models and Generative AI: Challenges for Research Ethics and How to Respond
Evangelos Pournaras
15
3
0
24 May 2023
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting
  Jailbreaks
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks
Abhinav Rao
S. Vashistha
Atharva Naik
Somak Aditya
Monojit Choudhury
17
17
0
24 May 2023
The Machine Psychology of Cooperation: Can GPT models operationalise
  prompts for altruism, cooperation, competitiveness and selfishness in
  economic games?
The Machine Psychology of Cooperation: Can GPT models operationalise prompts for altruism, cooperation, competitiveness and selfishness in economic games?
S. Phelps
Y. Russell
22
15
0
13 May 2023
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models
Max Lamparth
Anka Reuel
KELM
28
10
0
24 Feb 2023
Not what you've signed up for: Compromising Real-World LLM-Integrated
  Applications with Indirect Prompt Injection
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Kai Greshake
Sahar Abdelnabi
Shailesh Mishra
C. Endres
Thorsten Holz
Mario Fritz
SILM
17
426
0
23 Feb 2023
On the Impossible Safety of Large AI Models
On the Impossible Safety of Large AI Models
El-Mahdi El-Mhamdi
Sadegh Farhadkhani
R. Guerraoui
Nirupam Gupta
L. Hoang
Rafael Pinot
Sébastien Rouault
John Stephan
26
31
0
30 Sep 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
  and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
218
441
0
23 Aug 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,730
0
04 Mar 2022
Challenges in Detoxifying Language Models
Challenges in Detoxifying Language Models
Johannes Welbl
Amelia Glaese
J. Uesato
Sumanth Dathathri
John F. J. Mellor
Lisa Anne Hendricks
Kirsty Anderson
Pushmeet Kohli
Ben Coppin
Po-Sen Huang
LM&MA
242
191
0
15 Sep 2021
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
267
1,798
0
14 Dec 2020
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
275
1,561
0
18 Sep 2019
Previous
123...111213