ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.03693
  4. Cited By
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

5 October 2023
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
    SILM
ArXivPDFHTML

Papers citing "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"

50 / 395 papers shown
Title
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Sotiris Anagnostidis
Gregor Bachmann
Yeongmin Kim
Jonas Kohler
Markos Georgopoulos
A. Sanakoyeu
Yuming Du
Albert Pumarola
Ali K. Thabet
Edgar Schönfeld
76
0
0
27 Feb 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification
Vishnu Kabir Chhabra
Ding Zhu
Mohammad Mahdi Khalili
37
2
0
27 Feb 2025
Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond
Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond
Qizhou Wang
Jin Peng Zhou
Zhanke Zhou
Saebyeol Shin
Bo Han
Kilian Q. Weinberger
AILaw
ELM
MU
63
3
0
26 Feb 2025
Shh, don't say that! Domain Certification in LLMs
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Bernard Ghanem
Philip H. S. Torr
Adel Bibi
45
1
0
26 Feb 2025
Steered Generation via Gradient Descent on Sparse Features
Steered Generation via Gradient Descent on Sparse Features
Sumanta Bhattacharyya
Pedram Rooshenas
LLMSV
43
0
0
25 Feb 2025
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Zhexin Zhang
Leqi Lei
Junxiao Yang
Xijie Huang
Yida Lu
...
Xianqi Lei
C. Pan
Lei Sha
H. Wang
Minlie Huang
AAML
43
0
0
24 Feb 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
73
8
0
24 Feb 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
V. Cevher
AAML
45
0
0
24 Feb 2025
BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models
BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models
Yupeng Chang
Yi-Ju Chang
Yuan Wu
AI4CE
ALM
74
0
0
24 Feb 2025
Beyond Release: Access Considerations for Generative AI Systems
Beyond Release: Access Considerations for Generative AI Systems
Irene Solaiman
Rishi Bommasani
Dan Hendrycks
Ariel Herbert-Voss
Yacine Jernite
Aviya Skowron
Andrew Trask
58
1
0
23 Feb 2025
A generative approach to LLM harmfulness detection with special red flag tokens
A generative approach to LLM harmfulness detection with special red flag tokens
Sophie Xhonneux
David Dobre
Mehrnaz Mohfakhami
Leo Schwinn
Gauthier Gidel
45
1
0
22 Feb 2025
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging
Lin Lu
Zhigang Zuo
Ziji Sheng
Pan Zhou
MoMe
48
0
0
22 Feb 2025
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models
Qingsong Zou
Jingyu Xiao
Qing Li
Zhi Yan
Y. Wang
Li Xu
Wenxuan Wang
Kuofeng Gao
Ruoyu Li
Yong-jia Jiang
AAML
91
0
0
21 Feb 2025
Position: Standard Benchmarks Fail -- LLM Agents Present Overlooked Risks for Financial Applications
Position: Standard Benchmarks Fail -- LLM Agents Present Overlooked Risks for Financial Applications
Zichen Chen
Jiaao Chen
Jianda Chen
Misha Sra
ELM
34
1
0
21 Feb 2025
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
Huawei Lin
Yingjie Lao
Tong Geng
Tan Yu
Weijie Zhao
AAML
SILM
79
2
0
18 Feb 2025
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Yingshui Tan
Yilei Jiang
Y. Li
J. Liu
Xingyuan Bu
Wenbo Su
Xiangyu Yue
Xiaoyong Zhu
Bo Zheng
ALM
75
0
0
17 Feb 2025
RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars
RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars
Yuncheng Hua
Lizhen Qu
Zhuang Li
Hao Xue
Flora D. Salim
Gholamreza Haffari
ALM
130
0
0
17 Feb 2025
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Somnath Banerjee
Sayan Layek
Pratyush Chatterjee
Animesh Mukherjee
Rima Hazra
LLMSV
71
0
0
16 Feb 2025
Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences
Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences
Shanshan Han
Salman Avestimehr
Chaoyang He
71
0
0
12 Feb 2025
LUNAR: LLM Unlearning via Neural Activation Redirection
LUNAR: LLM Unlearning via Neural Activation Redirection
William F. Shen
Xinchi Qiu
Meghdad Kurmanji
Alex Iacob
Lorenzo Sani
Yihong Chen
Nicola Cancedda
Nicholas D. Lane
MU
49
1
0
11 Feb 2025
Trustworthy AI on Safety, Bias, and Privacy: A Survey
Trustworthy AI on Safety, Bias, and Privacy: A Survey
Xingli Fang
Jianwei Li
Varun Mulchandani
Jung-Eun Kim
37
0
0
11 Feb 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez Llorca
ELM
122
1
0
10 Feb 2025
OntoTune: Ontology-Driven Self-training for Aligning Large Language Models
OntoTune: Ontology-Driven Self-training for Aligning Large Language Models
Zhiqiang Liu
Chengtao Gan
Junjie Wang
Y. Zhang
Zhongpu Bo
Mengshu Sun
H. Chen
Wen Zhang
65
0
0
08 Feb 2025
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions
Jingxin Xu
Guoshun Nan
Sheng Guan
Sicong Leng
Y. Liu
Zixiao Wang
Yuyang Ma
Zhili Zhou
Yanzhao Hou
Xiaofeng Tao
LM&MA
53
0
0
08 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAML
ELM
52
3
0
04 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
80
2
0
03 Feb 2025
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Y. Wang
Tiansheng Huang
Li Shen
H. Yao
Haotian Luo
Rui Liu
Naiqiang Tan
Jiaxing Huang
Dacheng Tao
AAML
MoMe
CLL
109
1
0
30 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
75
41
0
20 Jan 2025
Scopes of Alignment
Scopes of Alignment
Kush R. Varshney
Zahra Ashktorab
Djallel Bouneffouf
Matthew D Riemer
Justin D. Weisz
34
0
0
15 Jan 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
66
5
0
08 Jan 2025
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Yang Ouyang
Hengrui Gu
Shuhang Lin
Wenyue Hua
Jie Peng
B. Kailkhura
Tianlong Chen
Kaixiong Zhou
Kaixiong Zhou
AAML
31
1
0
05 Jan 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Junfeng Fang
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
56
0
0
03 Jan 2025
Security Attacks on LLM-based Code Completion Tools
Security Attacks on LLM-based Code Completion Tools
Wen Cheng
Ke Sun
Xinyu Zhang
Wei Wang
SILM
AAML
ELM
48
0
0
03 Jan 2025
ArguMentor: Augmenting User Experiences with Counter-Perspectives
ArguMentor: Augmenting User Experiences with Counter-Perspectives
Priya Pitre
Kurt Luther
LLMAG
31
0
0
03 Jan 2025
Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution
Cut the Deadwood Out: Post-Training Model Purification with Selective Module Substitution
Yao Tong
Weijun Li
Xuanli He
Haolan Zhan
Qiongkai Xu
AAML
25
1
0
31 Dec 2024
Enhancing AI Safety Through the Fusion of Low Rank Adapters
Enhancing AI Safety Through the Fusion of Low Rank Adapters
Satya Swaroop Gudipudi
Sreeram Vipparla
Harpreet Singh
Shashwat Goel
Ponnurangam Kumaraguru
MoMe
AAML
44
2
0
30 Dec 2024
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Zaitang Li
Pin-Yu Chen
Tsung-Yi Ho
AAML
28
0
0
23 Dec 2024
Chained Tuning Leads to Biased Forgetting
Chained Tuning Leads to Biased Forgetting
Megan Ung
Alicia Sun
Samuel J. Bell
Bhaktipriya Radharapu
Levent Sagun
Adina Williams
CLL
KELM
84
0
0
21 Dec 2024
The Evolution of LLM Adoption in Industry Data Curation Practices
The Evolution of LLM Adoption in Industry Data Curation Practices
Crystal Qian
Michael Xieyang Liu
Emily Reif
Grady Simon
Nada Hussein
Nathan Clement
James Wexler
Carrie J. Cai
Michael Terry
Minsuk Kahng
AILaw
ELM
70
4
0
20 Dec 2024
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
Nilanjana Das
Edward Raff
Manas Gaur
AAML
101
1
0
20 Dec 2024
Quantized Delta Weight Is Safety Keeper
Quantized Delta Weight Is Safety Keeper
Yule Liu
Zhen Sun
Xinlei He
Xinyi Huang
80
2
0
29 Nov 2024
PEFT-as-an-Attack! Jailbreaking Language Models during Federated
  Parameter-Efficient Fine-Tuning
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
Shenghui Li
Edith C. H. Ngai
Fanghua Ye
Thiemo Voigt
SILM
83
6
0
28 Nov 2024
Neutralizing Backdoors through Information Conflicts for Large Language
  Models
Neutralizing Backdoors through Information Conflicts for Large Language Models
Chen Chen
Yuchen Sun
Xueluan Gong
Jiaxin Gao
K. Lam
KELM
AAML
67
0
0
27 Nov 2024
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for
  Jailbreaking Vision-Language Models
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao
Bryan Hooi
J. Liu
Kai-Wei Chang
Zi Huang
Yujun Cai
AAML
87
0
0
27 Nov 2024
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness
Avinash Amballa
Durga Sandeep Saluru
Gayathri Akkinapalli
Abhishek Sureddy
Akshay Kumar Sureddy
ALM
78
0
0
26 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang
Xuehai Tang
Jizhong Han
Songlin Hu
68
0
0
18 Nov 2024
CROW: Eliminating Backdoors from Large Language Models via Internal
  Consistency Regularization
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Nay Myat Min
Long H. Pham
Yige Li
Jun Sun
AAML
64
3
0
18 Nov 2024
ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?
ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?
Zheng Hui
Zhaoxiao Guo
Hang Zhao
Juanyong Duan
Lin Ai
Yinheng Li
Julia Hirschberg
Congrui Huang
70
1
0
18 Nov 2024
Defining and Evaluating Physical Safety for Large Language Models
Defining and Evaluating Physical Safety for Large Language Models
Yung-Chen Tang
Pin-Yu Chen
Tsung-Yi Ho
ELM
32
2
0
04 Nov 2024
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao
Kejiang Chen
W. Zhang
Nenghai Yu
AAML
38
0
0
03 Nov 2024
Previous
12345678
Next