Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.08770
Cited By
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
11 July 2024
Huanqian Wang
Yang Yue
Rui Lu
Jingxin Shi
Andrew Zhao
Shenzhi Wang
Shiji Song
Gao Huang
LM&Ro
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing"
10 / 10 papers shown
Title
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Qianli Ma
Dongrui Liu
Qian Chen
Linfeng Zhang
Jing Shao
MoMe
97
0
0
24 Feb 2025
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Le Yang
Ziwei Zheng
Boxu Chen
Zhengyu Zhao
Chenhao Lin
Chao Shen
VLM
138
3
0
18 Dec 2024
Extracting and Transferring Abilities For Building Multi-lingual Ability-enhanced Large Language Models
Zhipeng Chen
Liang Song
K. Zhou
Wayne Xin Zhao
B. Wang
Weipeng Chen
Ji-Rong Wen
60
0
0
10 Oct 2024
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Zhexin Zhang
Junxiao Yang
Pei Ke
Shiyao Cui
Chujie Zheng
Hongning Wang
Minlie Huang
AAML
MU
37
26
0
03 Jul 2024
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
Adib Hasan
Ileana Rugina
Alex Wang
AAML
47
22
0
19 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
64
95
0
03 Jan 2024
Language Model Alignment with Elastic Reset
Michael Noukhovitch
Samuel Lavoie
Florian Strub
Aaron Courville
KELM
87
25
0
06 Dec 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
91
167
0
10 Oct 2023
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
225
500
0
28 Sep 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,881
0
04 Mar 2022
1