Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.03652
Cited By
On The Fragility of Learned Reward Functions
9 January 2023
Lev McKinney
Yawen Duan
David M. Krueger
Adam Gleave
Re-assign community
ArXiv
PDF
HTML
Papers citing
"On The Fragility of Learned Reward Functions"
5 / 5 papers shown
Title
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Wei Shen
Guanlin Liu
Zheng Wu
Ruofei Zhu
Qingping Yang
Chao Xin
Yu Yue
Lin Yan
82
8
0
28 Mar 2025
Generalizing Reward Modeling for Out-of-Distribution Preference Learning
Chen Jia
25
2
0
22 Feb 2024
Defining and Characterizing Reward Hacking
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
57
53
0
27 Sep 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
303
11,730
0
04 Mar 2022
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
275
1,561
0
18 Sep 2019
1