Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.18540
Cited By
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
20 January 2025
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
29 / 29 papers shown
Title
ICon: In-Context Contribution for Automatic Data Selection
Yixin Yang
Qingxiu Dong
Linli Yao
Fangwei Zhu
Zhifang Sui
33
0
0
08 May 2025
Alleviating the Fear of Losing Alignment in LLM Fine-tuning
Kang Yang
Guanhong Tao
X. Chen
Jun Xu
23
0
0
13 Apr 2025
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
52
0
0
02 Apr 2025
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera
S. Kadhe
Farhan Ahmed
Syed Zawad
Holger Boche
MoMe
41
0
0
21 Mar 2025
Single-pass Detection of Jailbreaking Input in Large Language Models
Leyla Naz Candogan
Yongtao Wu
Elias Abad Rocamora
Grigorios G. Chrysos
V. Cevher
AAML
40
0
0
24 Feb 2025
GSCE: A Prompt Framework with Enhanced Reasoning for Reliable LLM-driven Drone Control
Wenhao Wang
Yanyan Li
Long Jiao
Jiawei Yuan
68
1
0
18 Feb 2025
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Y. Wang
Tiansheng Huang
Li Shen
H. Yao
Haotian Luo
Rui Liu
Naiqiang Tan
Jiaxing Huang
Dacheng Tao
AAML
MoMe
CLL
102
1
0
30 Jan 2025
A Grounded Observer Framework for Establishing Guardrails for Foundation Models in Socially Sensitive Domains
Rebecca Ramnauth
Dražen Brščić
Brian Scassellati
21
0
0
23 Dec 2024
Chained Tuning Leads to Biased Forgetting
Megan Ung
Alicia Sun
Samuel J. Bell
Bhaktipriya Radharapu
Levent Sagun
Adina Williams
CLL
KELM
79
0
0
21 Dec 2024
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
Guozhi Liu
Weiwei Lin
Tiansheng Huang
Ruichao Mo
Qi Mu
Li Shen
AAML
42
9
0
13 Oct 2024
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin
Sadhika Malladi
Adithya Bhaskar
Danqi Chen
Sanjeev Arora
Boris Hanin
84
12
0
11 Oct 2024
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
Han Shen
Pin-Yu Chen
Payel Das
Tianyi Chen
ALM
21
8
0
09 Oct 2024
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
28
21
0
26 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
37
1
0
05 Sep 2024
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
Chansung Park
Juyong Jiang
Fan Wang
Sayak Paul
Jing Tang
23
1
0
24 Aug 2024
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Gautam Bhattacharya
Pratik Joshi
Josh Kimball
Ling Liu
AAML
MoMe
37
10
0
18 Aug 2024
AttackER: Towards Enhancing Cyber-Attack Attribution with a Named Entity Recognition Dataset
Pritam Deka
Sampath Rajapaksha
Ruby Rani
Amirah Almutairi
Erisa Karafili
32
1
0
09 Aug 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
46
11
0
25 Jul 2024
Decoding-Time Language Model Alignment with Multiple Objectives
Ruizhe Shi
Yifang Chen
Yushi Hu
Alisa Liu
Hannaneh Hajishirzi
Noah A. Smith
Simon Du
31
11
0
27 Jun 2024
How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
Nidhir Bhavsar
Jonathan Jordan
Sherzod Hakimov
David Schlangen
16
0
0
20 Jun 2024
Is My Data in Your Retrieval Database? Membership Inference Attacks Against Retrieval Augmented Generation
Maya Anderson
Guy Amit
Abigail Goldsteen
AAML
29
12
0
30 May 2024
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
35
23
0
28 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
35
17
0
25 May 2024
Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
Yuxi Li
Yi Liu
Yuekang Li
Ling Shi
Gelei Deng
Shengquan Chen
Kailong Wang
24
12
0
20 May 2024
Proof-of-Learning with Incentive Security
Zishuo Zhao
Zhixuan Fang
Xuechao Wang
Xi Chen
Yuan Zhou
AAML
33
2
0
13 Apr 2024
Vaccine: Perturbation-aware Alignment for Large Language Model
Tiansheng Huang
Sihao Hu
Ling Liu
37
32
0
02 Feb 2024
FireAct: Toward Language Agent Fine-tuning
Baian Chen
Chang Shu
Ehsan Shareghi
Nigel Collier
Karthik Narasimhan
Shunyu Yao
ALM
LLMAG
96
96
0
09 Oct 2023
ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge
Yunxiang Li
Zihan Li
Kai Zhang
Ruilong Dan
Steven Jiang
You Zhang
LM&MA
AI4MH
114
366
0
24 Mar 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
1