ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.00761
  4. Cited By
Tamper-Resistant Safeguards for Open-Weight LLMs
v1v2v3v4 (latest)

Tamper-Resistant Safeguards for Open-Weight LLMs

International Conference on Learning Representations (ICLR), 2024
1 August 2024
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
Tarun Suresh
Maxwell Lin
Justin Wang
Rowan Wang
Ron Arel
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
    AAMLMU
ArXiv (abs)PDFHTML

Papers citing "Tamper-Resistant Safeguards for Open-Weight LLMs"

50 / 114 papers shown
Title
CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights
CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights
Mohaiminul Al Nahian
Abeer Matar A. Almalky
Gamana Aragonda
Ranyang Zhou
Sabbir Ahmed
Dmitry Ponomarev
Li Yang
Shaahin Angizi
Adnan Siraj Rakin
40
0
0
27 Nov 2025
Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning
Open-weight genome language model safeguards: Assessing robustness via adversarial fine-tuning
James R. M. Black
Moritz S. Hanke
Aaron Maiwald
Tina Hernandez-Boussard
Oliver M. Crook
Jaspreet Pannu
65
0
0
24 Nov 2025
Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
Xin Yi
Yue Li
Dongsheng Shi
Linlin Wang
Xiaoling Wang
Liang He
AAML
211
0
0
18 Nov 2025
REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
Liran Cohen
Yaniv Nemcovesky
Avi Mendelson
MUAAMLCLLKELM
279
0
0
06 Nov 2025
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Zixuan Hu
Li Shen
Zhenyi Wang
Yongxian Wei
Dacheng Tao
AAML
147
0
0
31 Oct 2025
A Survey on Unlearning in Large Language Models
A Survey on Unlearning in Large Language Models
Ruichen Qiu
Jiajun Tan
Jiayue Pu
Honglin Wang
Xiao-Shan Gao
Fei Sun
MUAILawPILM
606
0
0
29 Oct 2025
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Zheng-Xin Yong
Stephen H. Bach
LRM
224
0
0
23 Oct 2025
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
Bingjie Zhang
Yibo Yang
Renzhe
Dandan Guo
Jindong Gu
Philip Torr
Bernard Ghanem
255
0
0
16 Oct 2025
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
Guozhi Liu
Qi Mu
Tiansheng Huang
Xinhua Wang
Li Shen
Weiwei Lin
Zhang Li
116
1
0
11 Oct 2025
LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics
LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics
Chongyu Fan
Changsheng Wang
Yancheng Huang
Soumyadeep Pal
Sijia Liu
MUELM
176
0
0
08 Oct 2025
Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach
Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach
X. Li
Y. Wang
Bo Li
AAML
209
0
0
01 Oct 2025
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Yicheng Lang
Yihua Zhang
Chongyu Fan
Changsheng Wang
Jinghan Jia
Sijia Liu
MU
345
0
0
01 Oct 2025
Understanding the Dilemma of Unlearning for Large Language Models
Understanding the Dilemma of Unlearning for Large Language Models
Qingjie Zhang
Haoting Qian
Zhicong Huang
Cheng Hong
Shiyu Huang
Ke Xu
Chao Zhang
Han Qiu
MU
232
1
0
29 Sep 2025
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
Yuanbo Xie
Yingjie Zhang
Tianyun Liu
Duohe Ma
Tingwen Liu
AAML
115
1
0
18 Sep 2025
Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
Filip Sondej
Yushi Yang
MU
346
0
0
15 Sep 2025
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning
Weitao Feng
Lixu Wang
Tianyi Wei
Jie Zhang
Chongyang Gao
Sinong Zhan
Peizhuo Lv
Wei Dong
AAMLOffRLCLL
72
0
0
28 Aug 2025
Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
Jack Youstra
Mohammed Mahfoud
Yang Yan
Henry Sleight
Ethan Perez
Mrinank Sharma
AAML
144
2
0
23 Aug 2025
Gradient Surgery for Safe LLM Fine-Tuning
Gradient Surgery for Safe LLM Fine-Tuning
Biao Yi
Jiahao Li
Baolei Zhang
Lihai Nie
Tong Li
Tiansheng Huang
Zheli Liu
102
1
0
10 Aug 2025
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks
Bing Han
Feifei Zhao
Dongcheng Zhao
Guobin Shen
Ping Wu
Yu Shi
Yi Zeng
176
0
0
08 Aug 2025
LLM Unlearning Without an Expert Curated Dataset
LLM Unlearning Without an Expert Curated Dataset
Xiaoyuan Zhu
Muru Zhang
Ollie Liu
Robin Jia
Willie Neiswanger
MU
247
0
0
08 Aug 2025
Estimating Worst-Case Frontier Risks of Open-Weight LLMs
Estimating Worst-Case Frontier Risks of Open-Weight LLMs
Eric Wallace
Olivia Watkins
Miles Wang
Kai Chen
Chris Koch
154
7
0
05 Aug 2025
SDD: Self-Degraded Defense against Malicious Fine-tuning
SDD: Self-Degraded Defense against Malicious Fine-tuningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
ZiXuan Chen
Weikai Lu
Xin Lin
Ziqian Zeng
AAML
139
0
0
27 Jul 2025
A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction
A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction
Xiaohua Feng
Jiaming Zhang
Fengyuan Yu
C. Wang
Li Zhang
Kaixiang Li
Yuyuan Li
Chaochao Chen
Jianwei Yin
MU
246
2
0
26 Jul 2025
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Brendan Murphy
Dillon Bowen
Shahrad Mohammadzadeh
Tom Tseng
Julius Broomfield
Adam Gleave
Kellin Pelrine
258
2
0
15 Jul 2025
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025
Biao Yi
Tiansheng Huang
Sishuo Chen
Tong Li
Zheli Liu
Zhixuan Chu
Yiming Li
AAML
227
18
0
19 Jun 2025
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Christina Q. Knight
Kaustubh Deshpande
Ved Sirdeshmukh
Meher Mankikar
Scale Red Team
SEAL Research Team
Julian Michael
AAMLELM
291
2
0
17 Jun 2025
Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Filip Sondej
Yushi Yang
Mikołaj Kniejski
Marcel Windys
MU
341
2
0
14 Jun 2025
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Yeonwoo Jang
Shariqah Hossain
Ashwin Sreevatsa
Diogo Cruz
AAMLMU
216
2
0
11 Jun 2025
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
Shuo Yang
Qihui Zhang
Yuyang Liu
Yue Huang
Xiaojun Jia
...
Jiayu Yao
Jigang Wang
Hailiang Dai
Yibing Song
Li Yuan
202
8
0
10 Jun 2025
Distillation Robustifies Unlearning
Distillation Robustifies Unlearning
Bruce W. Lee
Addie Foote
Alex Infanger
Leni Shor
Harish Kamath
Jacob Goldman-Wetzler
Bryce Woodworth
Alex Cloud
Alexander Matt Turner
MU
381
4
0
06 Jun 2025
Benchmarking Misuse Mitigation Against Covert Adversaries
Benchmarking Misuse Mitigation Against Covert Adversaries
Davis Brown
Mahdi Sabbaghi
Luze Sun
Avi Schwarzschild
George Pappas
Eric Wong
Hamed Hassani
132
2
0
06 Jun 2025
Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning
Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning
Changsheng Wang
Yihua Zhang
Jinghan Jia
Parikshit Ram
Dennis L. Wei
Yuguang Yao
Soumyadeep Pal
Nathalie Baracaldo
Sijia Liu
MU
250
4
0
02 Jun 2025
Existing Large Language Model Unlearning Evaluations Are Inconclusive
Existing Large Language Model Unlearning Evaluations Are Inconclusive
Zhili Feng
Yixuan Even Xu
Avi Schwarzschild
Robert Kirk
Xander Davies
Yarin Gal
Avi Schwarzschild
J. Zico Kolter
MUELM
149
5
0
31 May 2025
Shape it Up! Restoring LLM Safety during Finetuning
Shape it Up! Restoring LLM Safety during Finetuning
ShengYun Peng
Pin-Yu Chen
Jianfeng Chi
Seongmin Lee
Duen Horng Chau
LLMAG
284
3
0
22 May 2025
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
Biao Yi
Tiansheng Huang
Baolei Zhang
Tong Li
Lihai Nie
Zheli Liu
Li Shen
MUAAML
287
5
0
22 May 2025
Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning
Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning
Thibaud Gloaguen
Mark Vero
Robin Staab
Martin Vechev
AAML
425
0
0
22 May 2025
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
Chengcan Wu
Zhixin Zhang
Zeming Wei
Yihao Zhang
Meng Sun
AAML
216
8
0
22 May 2025
Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study
Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe
Shaan Shah
Raghav Singhal
Praneeth Vepakomma
334
0
0
20 May 2025
Security practices in AI development
Security practices in AI developmentAi & Society (AS), 2025
Petr Spelda
Vit Stritecky
206
1
0
17 May 2025
Ready2Unlearn: A Learning-Time Approach for Preparing Models with Future Unlearning Readiness
Ready2Unlearn: A Learning-Time Approach for Preparing Models with Future Unlearning Readiness
Hanyu Duan
Yi Yang
Ahmed Abbasi
Kar Yan Tam
MUOnRL
245
1
0
16 May 2025
Layered Unlearning for Adversarial Relearning
Layered Unlearning for Adversarial Relearning
Timothy Qian
Vinith Suriyakumar
Ashia Wilson
Dylan Hadfield-Menell
MU
310
1
0
14 May 2025
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng Li
Anil Vullikanti
AAML
303
10
0
11 May 2025
Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
Wenjun Cao
AAML
215
2
0
07 May 2025
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera
S. Kadhe
Praneet Adusumilli
Syed Zawad
Holger Boche
MoMe
237
14
0
21 Mar 2025
Improving LLM Safety Alignment with Dual-Objective Optimization
Improving LLM Safety Alignment with Dual-Objective Optimization
Xuandong Zhao
Will Cai
Tianneng Shi
David Huang
Licong Lin
Song Mei
Kurt Thomas
AAMLMU
453
14
0
05 Mar 2025
Beyond Release: Access Considerations for Generative AI Systems
Beyond Release: Access Considerations for Generative AI Systems
Irene Solaiman
Rishi Bommasani
Dan Hendrycks
Ariel Herbert-Voss
Yacine Jernite
Aviya Skowron
Andrew Trask
504
3
0
23 Feb 2025
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mohfakhami
Leo Schwinn
Gauthier Gidel
545
2
0
22 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAMLELM
336
18
0
04 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MUAAMLELM
567
23
0
03 Feb 2025
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Yun Wang
Tiansheng Huang
Li Shen
Huanjin Yao
Haotian Luo
Rui Liu
Naiqiang Tan
Jiaxing Huang
Dacheng Tao
AAMLMoMeCLL
387
10
0
30 Jan 2025
123
Next