ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.01295
  4. Cited By
Towards Safety and Helpfulness Balanced Responses via Controllable Large
  Language Models

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

1 April 2024
Yi-Lin Tuan
Xilun Chen
Eric Michael Smith
Louis Martin
Soumya Batra
Asli Celikyilmaz
William Yang Wang
Daniel M. Bikel
ArXivPDFHTML

Papers citing "Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models"

8 / 8 papers shown
Title
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
Tianyu Wu
Lingrui Mei
Ruibin Yuan
Lujun Li
Wei Xue
Yike Guo
53
1
0
04 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
52
1
0
03 Oct 2024
A Gradient Analysis Framework for Rewarding Good and Penalizing Bad
  Examples in Language Models
A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models
Yi-Lin Tuan
William Yang Wang
37
1
0
29 Aug 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
40
35
0
31 May 2024
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
425
12,150
0
04 Mar 2022
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
303
1,620
0
18 Sep 2019
Language Models as Knowledge Bases?
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
458
2,592
0
03 Sep 2019
A causal framework for explaining the predictions of black-box
  sequence-to-sequence models
A causal framework for explaining the predictions of black-box sequence-to-sequence models
David Alvarez-Melis
Tommi Jaakkola
CML
235
203
0
06 Jul 2017
1