Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.01295
Cited By
Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models
1 April 2024
Yi-Lin Tuan
Xilun Chen
Eric Michael Smith
Louis Martin
Soumya Batra
Asli Celikyilmaz
William Yang Wang
Daniel M. Bikel
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models"
8 / 8 papers shown
Title
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
Tianyu Wu
Lingrui Mei
Ruibin Yuan
Lujun Li
Wei Xue
Yike Guo
53
1
0
04 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
52
1
0
03 Oct 2024
A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
Yi-Lin Tuan
William Yang Wang
37
1
0
29 Aug 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
40
35
0
31 May 2024
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
425
12,150
0
04 Mar 2022
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
303
1,620
0
18 Sep 2019
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
458
2,592
0
03 Sep 2019
A causal framework for explaining the predictions of black-box sequence-to-sequence models
David Alvarez-Melis
Tommi Jaakkola
CML
235
203
0
06 Jul 2017
1