Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.14488
Cited By
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability
23 May 2024
Yanrui Du
Sendong Zhao
Danyang Zhao
Ming Ma
Yuhan Chen
Liangyu Huo
Qing Yang
Dongliang Xu
Bing Qin
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability"
9 / 9 papers shown
Title
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Y. Wang
Tiansheng Huang
Li Shen
H. Yao
Haotian Luo
Rui Liu
Naiqiang Tan
Jiaxing Huang
Dacheng Tao
AAML
MoMe
CLL
109
1
0
30 Jan 2025
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
68
1
0
09 Oct 2024
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong
Zhanhui Zhou
Chao Yang
Jing Shao
Yu Qiao
ELM
49
55
0
14 Feb 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Jinyuan Jia
Bill Yuchen Lin
Radha Poovendran
AAML
129
82
0
14 Feb 2024
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xing-ming Guo
Fangxu Yu
Huan Zhang
Lianhui Qin
Bin Hu
AAML
112
69
0
13 Feb 2024
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
97
92
0
22 Jan 2024
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
110
292
0
19 Sep 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
218
441
0
23 Aug 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
1