ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.03825
  4. Cited By
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak
  Prompts on Large Language Models

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

7 August 2023
Xinyue Shen
Z. Chen
Michael Backes
Yun Shen
Yang Zhang
    SILM
ArXivPDFHTML

Papers citing ""Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"

50 / 178 papers shown
Title
Feint and Attack: Attention-Based Strategies for Jailbreaking and
  Protecting LLMs
Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
Rui Pu
Chaozhuo Li
Rui Ha
Zejian Chen
Litian Zhang
Z. Liu
Lirong Qiu
Xi Zhang
AAML
16
1
0
18 Oct 2024
SPIN: Self-Supervised Prompt INjection
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
23
0
0
17 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
On the Role of Attention Heads in Large Language Model Safety
Z. Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Junfeng Fang
Yongbin Li
57
5
0
17 Oct 2024
SoK: Prompt Hacking of Large Language Models
SoK: Prompt Hacking of Large Language Models
Baha Rababah
Shang
Wu
Matthew Kwiatkowski
Carson Leung
Cuneyt Gurcan Akcora
AAML
33
2
0
16 Oct 2024
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Furong Huang
SILM
AAML
38
0
0
15 Oct 2024
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Yifan Luo
Zhennan Zhou
Meitan Wang
Bin Dong
19
0
0
14 Oct 2024
DAWN: Designing Distributed Agents in a Worldwide Network
DAWN: Designing Distributed Agents in a Worldwide Network
Zahra Aminiranjbar
Jianan Tang
Qiudan Wang
Shubha Pant
Mahesh Viswanathan
LLMAG
AI4CE
23
1
0
11 Oct 2024
Unraveling and Mitigating Safety Alignment Degradation of
  Vision-Language Models
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu
Chao Shang
Ling Liu
Nikolaos Pappas
Jie Ma
Neha Anna John
Srikanth Doss Kadarundalagi Raghuram Doss
Lluís Marquez
Miguel Ballesteros
Yassine Benajiba
34
4
0
11 Oct 2024
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent
  Enhanced Explanation Evaluation Framework
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
Fan Liu
Yue Feng
Zhao Xu
Lixin Su
Xinyu Ma
Dawei Yin
Hao Liu
ELM
22
7
0
11 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
73
1
0
09 Oct 2024
Superficial Safety Alignment Hypothesis
Superficial Safety Alignment Hypothesis
Jianwei Li
Jung-Eun Kim
24
1
0
07 Oct 2024
Permissive Information-Flow Analysis for Large Language Models
Permissive Information-Flow Analysis for Large Language Models
Shoaib Ahmed Siddiqui
Radhika Gaonkar
Boris Köpf
David M. Krueger
Andrew J. Paverd
Ahmed Salem
Shruti Tople
Lukas Wutschitz
Menglin Xia
Santiago Zanella Béguelin
18
1
0
04 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
29
12
0
02 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
56
9
0
30 Sep 2024
Enhancing Guardrails for Safe and Secure Healthcare AI
Enhancing Guardrails for Safe and Secure Healthcare AI
Ananya Gangavarapu
14
0
0
25 Sep 2024
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks
  against RAG-based Inference in Scale and Severity Using Jailbreaking
Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking
Stav Cohen
Ron Bitton
Ben Nassi
34
4
0
12 Sep 2024
Differentially Private Kernel Density Estimation
Differentially Private Kernel Density Estimation
Erzhi Liu
Jerry Yao-Chieh Hu
Alex Reneau
Zhao Song
Han Liu
56
3
0
03 Sep 2024
Acceptable Use Policies for Foundation Models
Acceptable Use Policies for Foundation Models
Kevin Klyman
26
14
0
29 Aug 2024
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational
  Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
Aman Priyanshu
Supriti Vijay
AAML
23
1
0
28 Aug 2024
Is Generative AI the Next Tactical Cyber Weapon For Threat Actors?
  Unforeseen Implications of AI Generated Cyber Attacks
Is Generative AI the Next Tactical Cyber Weapon For Threat Actors? Unforeseen Implications of AI Generated Cyber Attacks
Yusuf Usman
Aadesh Upadhyay
P. Gyawali
Robin Chataut
AAML
24
2
0
23 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
33
7
0
02 Aug 2024
Risks, Causes, and Mitigations of Widespread Deployments of Large
  Language Models (LLMs): A Survey
Risks, Causes, and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey
Md. Nazmus Sakib
Md Athikul Islam
Royal Pathak
Md Mashrur Arifin
ALM
PILM
29
1
0
01 Aug 2024
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction
  Amplification
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
Boyang Zhang
Yicong Tan
Yun Shen
Ahmed Salem
Michael Backes
Savvas Zannettou
Yang Zhang
LLMAG
AAML
40
12
0
30 Jul 2024
RedAgent: Red Teaming Large Language Models with Context-aware
  Autonomous Language Agent
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu
Wenhui Zhang
Zhibo Wang
Feng Xiao
Rui Zheng
Yunhe Feng
Zhongjie Ba
Kui Ren
AAML
LLMAG
29
11
0
23 Jul 2024
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
Blazej Manczak
Eliott Zemour
Eric Lin
Vaikkunth Mugunthan
26
2
0
23 Jul 2024
XAI meets LLMs: A Survey of the Relation between Explainable AI and
  Large Language Models
XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models
Erik Cambria
Lorenzo Malandri
Fabio Mercorio
Navid Nobani
Andrea Seveso
43
11
0
21 Jul 2024
BadRobot: Jailbreaking Embodied LLMs in the Physical World
BadRobot: Jailbreaking Embodied LLMs in the Physical World
Hangtao Zhang
Chenyu Zhu
Xianlong Wang
Ziqi Zhou
Yichen Wang
...
Shengshan Hu
Leo Yu Zhang
Aishan Liu
Peijin Guo
Leo Yu Zhang
LM&Ro
40
6
0
16 Jul 2024
Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation
Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation
Riccardo Cantini
Giada Cosenza
A. Orsino
Domenico Talia
AAML
45
5
0
11 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
37
6
0
10 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
34
77
0
05 Jul 2024
SOS! Soft Prompt Attack Against Open-Source Large Language Models
SOS! Soft Prompt Attack Against Open-Source Large Language Models
Ziqing Yang
Michael Backes
Yang Zhang
Ahmed Salem
AAML
25
6
0
03 Jul 2024
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts
  Discovery from Large-Scale Human-LLM Conversational Datasets
JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
Zhihua Jin
Shiyi Liu
Haotian Li
Xun Zhao
Huamin Qu
18
3
0
03 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large
  Language and Vision-Language Models
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World
  Data
Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data
Nahema Marchal
Rachel Xu
Rasmi Elasmar
Iason Gabriel
Beth Goldberg
William S. Isaac
LLMAG
14
12
0
19 Jun 2024
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large
  Language Models
CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models
Wenjing Zhang
Xuejiao Lei
Zhaoxiang Liu
Meijuan An
Bikun Yang
Kaikai Zhao
Kai Wang
Shiguo Lian
ELM
31
7
0
14 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
42
10
0
13 Jun 2024
Merging Improves Self-Critique Against Jailbreak Attacks
Merging Improves Self-Critique Against Jailbreak Attacks
Victor Gallego
AAML
MoMe
36
3
0
11 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
71
8
0
08 Jun 2024
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Fan Liu
Zhao Xu
Hao Liu
AAML
43
9
0
07 Jun 2024
Measure-Observe-Remeasure: An Interactive Paradigm for
  Differentially-Private Exploratory Analysis
Measure-Observe-Remeasure: An Interactive Paradigm for Differentially-Private Exploratory Analysis
Priyanka Nanayakkara
Hyeok Kim
Yifan Wu
Ali Sarvghad
Narges Mahyar
G. Miklau
Jessica Hullman
26
17
0
04 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future
  Pathways
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
42
19
0
04 Jun 2024
Safeguarding Large Language Models: A Survey
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
35
17
0
03 Jun 2024
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of
  LLM Safeguards
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Diego Dorn
Alexandre Variengien
Charbel-Raphaël Ségerie
Vincent Corruble
24
7
0
03 Jun 2024
Enhancing Jailbreak Attack Against Large Language Models through Silent
  Tokens
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
31
18
0
31 May 2024
Voice Jailbreak Attacks Against GPT-4o
Voice Jailbreak Attacks Against GPT-4o
Xinyue Shen
Yixin Wu
Michael Backes
Yang Zhang
AuLLM
26
9
0
29 May 2024
A Theoretical Understanding of Self-Correction through In-context
  Alignment
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
28
13
0
28 May 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
44
30
0
27 May 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
125
52
3
23 May 2024
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While
  Preserving Their Usability
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability
Yanrui Du
Sendong Zhao
Danyang Zhao
Ming Ma
Yuhan Chen
Liangyu Huo
Qing Yang
Dongliang Xu
Bing Qin
24
5
0
23 May 2024
Semantic-guided Prompt Organization for Universal Goal Hijacking against
  LLMs
Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs
Yihao Huang
Chong Wang
Xiaojun Jia
Qing-Wu Guo
Felix Juefei Xu
Jian Zhang
G. Pu
Yang Liu
25
8
0
23 May 2024
Previous
1234
Next