ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.12321
  4. Cited By
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
v1v2 (latest)

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks

19 December 2023
Jason Vega
Isha Chaudhary
Changming Xu
Gagandeep Singh
    AAML
ArXiv (abs)PDFHTMLGithub (15★)

Papers citing "Bypassing the Safety Training of Open-Source LLMs with Priming Attacks"

27 / 27 papers shown
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Matching Ranks Over Probability Yields Truly Deep Safety Alignment
Jason Vega
Gagandeep Singh
AAML
126
0
0
05 Dec 2025
Lumos: Let there be Language Model System Certification
Lumos: Let there be Language Model System Certification
Isha Chaudhary
Vedaant V. Jain
Avaljot Singh
Kavya Sachdeva
Sayan Ranu
Gagandeep Singh
Gagandeep Singh
ELMLRM
125
0
0
02 Dec 2025
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines
Yuhang Wang
Yanxu Zhu
Dongyuan Lu
Jitao Sang
AAMLSILMELMLRM
622
0
0
26 Nov 2025
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives
Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives
Chloe Li
Mary Phuong
Daniel Tan
327
6
0
10 Nov 2025
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
Yuanbo Xie
Yingjie Zhang
Tianyun Liu
Duohe Ma
Tingwen Liu
AAML
198
3
0
18 Sep 2025
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Ziqi Miao
Lijun Li
Yuan Xiong
Zhenhua Liu
Pengyu Zhu
Jing Shao
AAML
235
6
0
07 Jul 2025
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Ziqi Miao
Yi Ding
Lijun Li
Jing Shao
AAML
328
18
0
03 Jul 2025
Adversarial Manipulation of Reasoning Models using Internal Representations
Adversarial Manipulation of Reasoning Models using Internal Representations
Kureha Yamaguchi
Benjamin Etheridge
Andy Arditi
AAMLLRM
196
3
0
03 Jul 2025
Saffron-1: Safety Inference Scaling
Saffron-1: Safety Inference Scaling
Ruizhong Qiu
Gaotang Li
Tianxin Wei
Jingrui He
Hanghang Tong
LRM
324
0
0
06 Jun 2025
Discovering Forbidden Topics in Language Models
Discovering Forbidden Topics in Language Models
Can Rager
Chris Wendler
Rohit Gandikota
David Bau
424
4
0
23 May 2025
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Wonje Jeung
Sangyeon Yoon
Minsuk Kahng
Albert No
LRMLLMSV
862
12
0
20 May 2025
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
Zhehao Zhang
Weijie Xu
Fanyou Wu
Chandan K. Reddy
465
22
0
12 May 2025
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
422
0
0
12 May 2025
Representation Bending for Large Language Model Safety
Representation Bending for Large Language Model SafetyAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAMLALMKELM
491
17
0
02 Apr 2025
SafeArena: Evaluating the Safety of Autonomous Web Agents
SafeArena: Evaluating the Safety of Autonomous Web Agents
Ada Defne Tur
Nicholas Meade
Xing Han Lù
Alejandra Zambrano
Arkil Patel
Esin Durmus
Spandana Gella
Karolina Stañczak
Siva Reddy
LLMAGELM
497
50
0
06 Mar 2025
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
Jacob Dunefsky
Arman Cohan
LLMSV
293
1
0
26 Feb 2025
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mohfakhami
Leo Schwinn
Gauthier Gidel
711
2
0
22 Feb 2025
Fast Proxies for LLM Robustness Evaluation
Fast Proxies for LLM Robustness Evaluation
Tim Beyer
Jan Schuchardt
Leo Schwinn
Stephan Günnemann
AAML
344
3
0
14 Feb 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesAAAI Conference on Artificial Intelligence (AAAI), 2024
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
366
34
0
08 Jan 2025
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM
  Safety Alignment
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment
Jason Vega
Junsheng Huang
Gaokai Zhang
Hangoo Kang
Minjia Zhang
Gagandeep Singh
347
4
0
05 Nov 2024
HiddenGuard: Fine-Grained Safe Generation with Specialized
  Representation Router
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Ruibin Yuan
Xueqi Cheng
297
11
0
03 Oct 2024
Backtracking Improves Generation Safety
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
392
27
0
22 Sep 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens DeepInternational Conference on Learning Representations (ICLR), 2024
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
304
360
0
10 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Improving Alignment and Robustness with Circuit BreakersNeural Information Processing Systems (NeurIPS), 2024
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
735
252
0
06 Jun 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
1.0K
451
0
02 Apr 2024
Certifying Knowledge Comprehension in LLMs
Certifying Knowledge Comprehension in LLMs
Isha Chaudhary
Vedaant V. Jain
Gagandeep Singh
395
0
0
24 Feb 2024
Building Guardrails for Large Language Models
Building Guardrails for Large Language Models
Yizhen Dong
Ronghui Mu
Gao Jin
Yi Qi
Jinwei Hu
Xingyu Zhao
Jie Meng
Wenjie Ruan
Xiaowei Huang
OffRL
564
82
0
02 Feb 2024
1
Page 1 of 1