ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.09063
  4. Cited By
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

14 February 2024
Leo Schwinn
David Dobre
Sophie Xhonneux
Gauthier Gidel
Stephan Gunnemann
    AAML
ArXivPDFHTML

Papers citing "Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space"

32 / 32 papers shown
Title
Representation Bending for Large Language Model Safety
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
52
0
0
02 Apr 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALM
ELM
83
0
0
04 Mar 2025
REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
Simon Geisler
Tom Wollschlager
M. H. I. Abdalla
Vincent Cohen-Addad
Johannes Gasteiger
Stephan Günnemann
AAML
68
2
0
24 Feb 2025
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschlager
Jannes Elstner
Simon Geisler
Vincent Cohen-Addad
Stephan Günnemann
Johannes Gasteiger
LLMSV
47
0
0
24 Feb 2025
A generative approach to LLM harmfulness detection with special red flag tokens
A generative approach to LLM harmfulness detection with special red flag tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mohfakhami
Leo Schwinn
Gauthier Gidel
40
1
0
22 Feb 2025
Robustness and Cybersecurity in the EU Artificial Intelligence Act
Robustness and Cybersecurity in the EU Artificial Intelligence Act
Henrik Nolte
Miriam Rateike
Michèle Finck
36
1
0
22 Feb 2025
Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Haokun Chen
Sebastian Szyller
Weilin Xu
N. Himayat
MU
AAML
36
0
0
20 Feb 2025
Fast Proxies for LLM Robustness Evaluation
Fast Proxies for LLM Robustness Evaluation
Tim Beyer
Jan Schuchardt
Leo Schwinn
Stephan Günnemann
AAML
34
0
0
14 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
60
2
0
03 Feb 2025
Extracting Unlearned Information from LLMs with Activation Steering
Extracting Unlearned Information from LLMs with Activation Steering
Atakan Seyitoğlu
A. Kuvshinov
Leo Schwinn
Stephan Günnemann
MU
LLMSV
35
3
0
04 Nov 2024
Adversarial Attacks on Large Language Models Using Regularized
  Relaxation
Adversarial Attacks on Large Language Models Using Regularized Relaxation
Samuel Jacob Chacko
Sajib Biswas
Chashi Mahiul Islam
Fatema Tabassum Liza
Xiuwen Liu
AAML
21
1
0
24 Oct 2024
Bayesian scaling laws for in-context learning
Bayesian scaling laws for in-context learning
Aryaman Arora
Dan Jurafsky
Christopher Potts
Noah D. Goodman
14
2
0
21 Oct 2024
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Zi Wang
Divyam Anshumaan
Ashish Hooda
Yudong Chen
Somesh Jha
AAML
30
0
0
05 Oct 2024
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
Yan Scholten
Stephan Günnemann
Leo Schwinn
MU
40
6
0
04 Oct 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MU
AAML
54
31
0
26 Sep 2024
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large
  Language Models?
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?
Mohammad Bahrami Karkevandi
Nishant Vishwamitra
Peyman Najafirad
AAML
35
1
0
05 Aug 2024
Revisiting the Robust Alignment of Circuit Breakers
Revisiting the Robust Alignment of Circuit Breakers
Leo Schwinn
Simon Geisler
AAML
19
4
0
22 Jul 2024
Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large
  Language Models
Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models
Zihao Xu
Yi Liu
Gelei Deng
Kailong Wang
Yuekang Li
Ling Shi
S. Picek
KELM
38
0
0
16 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
29
6
0
10 Jul 2024
Large-Scale Dataset Pruning in Adversarial Training through Data
  Importance Extrapolation
Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation
Bjorn Nieth
Thomas Altstidl
Leo Schwinn
Björn Eskofier
AAML
21
2
0
19 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Improving Alignment and Robustness with Circuit Breakers
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
34
70
0
06 Jun 2024
Efficient Adversarial Training in LLMs with Continuous Attacks
Efficient Adversarial Training in LLMs with Continuous Attacks
Sophie Xhonneux
Alessandro Sordoni
Stephan Günnemann
Gauthier Gidel
Leo Schwinn
AAML
29
5
0
24 May 2024
Rethinking LLM Memorization through the Lens of Adversarial Compression
Rethinking LLM Memorization through the Lens of Adversarial Compression
Avi Schwarzschild
Zhili Feng
Pratyush Maini
Zachary Chase Lipton
J. Zico Kolter
31
34
0
23 Apr 2024
Uncovering Safety Risks of Large Language Models through Concept
  Activation Vector
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Zhihao Xu
Ruixuan Huang
Changyu Chen
Shuai Wang
Xiting Wang
LLMSV
24
9
0
18 Apr 2024
Threats, Attacks, and Defenses in Machine Unlearning: A Survey
Threats, Attacks, and Defenses in Machine Unlearning: A Survey
Ziyao Liu
Huanyi Ye
Chen Chen
Yongsen Zheng
K. Lam
AAML
MU
21
28
0
20 Mar 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial
  Training
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper
Lennart Schulze
Oam Patel
Dylan Hadfield-Menell
AAML
35
27
0
08 Mar 2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
...
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
37
130
0
05 Mar 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
35
54
0
26 Feb 2024
Who's Harry Potter? Approximate Unlearning in LLMs
Who's Harry Potter? Approximate Unlearning in LLMs
Ronen Eldan
M. Russinovich
MU
MoMe
98
171
0
03 Oct 2023
Ewald-based Long-Range Message Passing for Molecular Graphs
Ewald-based Long-Range Message Passing for Molecular Graphs
Arthur Kosmala
Johannes Gasteiger
Nicholas Gao
Stephan Günnemann
64
25
0
08 Mar 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
  and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
211
327
0
23 Aug 2022
Improving Robustness against Real-World and Worst-Case Distribution
  Shifts through Decision Region Quantification
Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification
Leo Schwinn
Leon Bungert
A. Nguyen
René Raab
Falk Pulsmeyer
Doina Precup
Björn Eskofier
Dario Zanca
OOD
34
9
0
19 May 2022
1