ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.15289
  4. Cited By
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
v1v2v3v4v5 (latest)

SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage

Annual Meeting of the Association for Computational Linguistics (ACL), 2024
19 December 2024
Xiaoning Dong
Wenbo Hu
Wei Xu
Tianxing He
ArXiv (abs)PDFHTML

Papers citing "SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage"

32 / 32 papers shown
Title
Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
Zhaoqi Wang
D. He
Zijian Zhang
Xin Li
Liehuang Zhu
Meng Li
Jiamou Liu
80
0
0
28 Sep 2025
AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs
AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs
Debdeep Sanyal
Manodeep Ray
Murari Mandal
AAML
188
0
0
06 Sep 2025
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang
Jinghong Chen
Jingbiao Mei
Weizhe Lin
Bill Byrne
AAML
112
0
0
22 Aug 2025
PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
Yelim Ahn
Jaejin Lee
AAML
52
0
0
02 Aug 2025
Qwen2 Technical Report
Qwen2 Technical Report
An Yang
Baosong Yang
Binyuan Hui
Jian Xu
Bowen Yu
...
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zhifang Guo
Zhi-Wei Fan
OSLMVLMMU
620
1,672
0
15 Jul 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
342
407
0
17 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Improving Alignment and Robustness with Circuit BreakersNeural Information Processing Systems (NeurIPS), 2024
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
582
202
0
06 Jun 2024
AdaShield: Safeguarding Multimodal Large Language Models from
  Structure-based Attack via Adaptive Shield Prompting
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield PromptingEuropean Conference on Computer Vision (ECCV), 2024
Yu Wang
Xiaogeng Liu
Yu-Feng Li
Muhao Chen
Chaowei Xiao
AAML
269
99
0
14 Mar 2024
Defending LLMs against Jailbreaking Attacks via Backtranslation
Defending LLMs against Jailbreaking Attacks via Backtranslation
Yihan Wang
Zhouxing Shi
Andrew Bai
Cho-Jui Hsieh
AAML
311
62
0
26 Feb 2024
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical
  Gradient Analysis
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
Yueqi Xie
Minghong Fang
Renjie Pi
Neil Zhenqiang Gong
254
63
0
21 Feb 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
489
188
0
19 Feb 2024
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
  Clues
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
Zhiyuan Chang
Mingyang Li
Yi Liu
Peng Li
Qing Wang
Yang Liu
404
65
0
14 Feb 2024
Robust Prompt Optimization for Defending Language Models Against
  Jailbreaking Attacks
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
Bo Li
Haohan Wang
AAML
397
127
0
30 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
327
472
0
12 Jan 2024
Cognitive Overload: Jailbreaking Large Language Models with Overloaded
  Logical Thinking
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
Nan Xu
Fei Wang
Ben Zhou
Bangzheng Li
Chaowei Xiao
Muhao Chen
331
83
0
16 Nov 2023
Defending Large Language Models Against Jailbreaking Attacks Through
  Goal Prioritization
Defending Large Language Models Against Jailbreaking Attacks Through Goal PrioritizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Zhexin Zhang
Junxiao Yang
Pei Ke
Fei Mi
Hongning Wang
Shiyu Huang
AAML
313
172
0
15 Nov 2023
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can
  Fool Large Language Models Easily
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models EasilyNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Peng Ding
Jun Kuang
Dan Ma
Xuezhi Cao
Yunsen Xian
Jiajun Chen
Shujian Huang
AAML
433
213
0
14 Nov 2023
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li
Zhanke Zhou
Jianing Zhu
Jiangchao Yao
Tongliang Liu
Bo Han
357
278
0
06 Nov 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Avi Schwarzschild
Guang Cheng
Hamed Hassani
George J. Pappas
Eric Wong
AAML
613
1,051
0
12 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated
  Jailbreak Prompts
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
892
499
0
19 Sep 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLMAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
439
192
0
18 Sep 2023
MAmmoTH: Building Math Generalist Models through Hybrid Instruction
  Tuning
MAmmoTH: Building Math Generalist Models through Hybrid Instruction TuningInternational Conference on Learning Representations (ICLR), 2023
Xiang Yue
Xingwei Qu
Ge Zhang
Yao Fu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
AIMatLRM
512
508
0
11 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language
  Models
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
497
564
0
01 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
623
2,243
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
7.0K
15,134
0
18 Jul 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward
  Model
Direct Preference Optimization: Your Language Model is Secretly a Reward ModelNeural Information Processing Systems (NeurIPS), 2023
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
835
6,566
0
29 May 2023
Co-Writing Screenplays and Theatre Scripts with Language Models: An
  Evaluation by Industry Professionals
Co-Writing Screenplays and Theatre Scripts with Language Models: An Evaluation by Industry ProfessionalsInternational Conference on Human Factors in Computing Systems (CHI), 2022
Piotr Wojciech Mirowski
Kory W. Mathewson
Jaylen Pittman
Richard Evans
HAI
229
321
0
29 Sep 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedbackNeural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
2.0K
17,279
0
04 Mar 2022
Exploring the Limits of Domain-Adaptive Training for Detoxifying
  Large-Scale Language Models
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language ModelsNeural Information Processing Systems (NeurIPS), 2022
Wei Ping
Ming-Yu Liu
Chaowei Xiao
Peng Xu
M. Patwary
Mohammad Shoeybi
Yue Liu
Anima Anandkumar
Bryan Catanzaro
277
79
0
08 Feb 2022
Language Models are Few-Shot Learners
Language Models are Few-Shot LearnersNeural Information Processing Systems (NeurIPS), 2020
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
2.0K
51,925
0
28 May 2020
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
2.9K
107,607
0
11 Oct 2018
Deep reinforcement learning from human preferences
Deep reinforcement learning from human preferencesNeural Information Processing Systems (NeurIPS), 2017
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
1.5K
4,336
0
12 Jun 2017
1