Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.05862
Cited By
Understanding Learned Reward Functions
10 December 2020
Eric J. Michaud
Adam Gleave
Stuart J. Russell
XAI
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Understanding Learned Reward Functions"
26 / 26 papers shown
Title
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Kai Ye
Hongyi Zhou
Jin Zhu
Francesco Quinzan
C. Shi
23
1
0
03 Apr 2025
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?
Xueru Wen
Jie Lou
Y. Lu
Hongyu Lin
Xing Yu
Xinyu Lu
Ben He
Xianpei Han
Debing Zhang
Le Sun
ALM
61
4
0
17 Feb 2025
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
40
42
0
26 May 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
46
51
0
10 May 2024
Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification
Joar Skalse
Alessandro Abate
28
2
0
11 Mar 2024
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao
Sen Zhang
Liang Ding
Rong Bao
Lefei Zhang
Dacheng Tao
27
12
0
14 Feb 2024
Explaining Learned Reward Functions with Counterfactual Trajectories
Jan Wehner
Frans Oliehoek
Luciano Cavalcante Siebert
26
0
0
07 Feb 2024
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
Wei Xiong
Hanze Dong
Chen Ye
Ziqi Wang
Han Zhong
Heng Ji
Nan Jiang
Tong Zhang
OffRL
38
156
0
18 Dec 2023
FoMo Rewards: Can we cast foundation models as reward functions?
Ekdeep Singh Lubana
Johann Brehmer
P. D. Haan
Taco S. Cohen
OffRL
LRM
45
2
0
06 Dec 2023
Inverse Decision Modeling: Learning Interpretable Representations of Behavior
Daniel Jarrett
Alihan Huyuk
M. Schaar
AI4CE
15
27
0
28 Oct 2023
Active teacher selection for reinforcement learning from human feedback
Rachel Freedman
Justin Svegliato
K. H. Wray
Stuart J. Russell
31
6
0
23 Oct 2023
An Information Bottleneck Characterization of the Understanding-Workload Tradeoff
Lindsay M. Sanneman
Mycal Tucker
Julie A. Shah
24
2
0
11 Oct 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
44
470
0
27 Jul 2023
Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?
Akansha Kalra
Daniel S. Brown
16
0
0
22 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
35
135
0
07 Jun 2023
Learning Interpretable Models of Aircraft Handling Behaviour by Reinforcement Learning from Human Feedback
Tom Bewley
J. Lawry
Arthur G. Richards
27
1
0
26 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Rui Pan
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
18
401
0
13 Apr 2023
Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision
Ashvin Nair
Brian Zhu
Gokul Narayanan
Eugen Solowjow
Sergey Levine
OffRL
OnRL
25
14
0
27 Oct 2022
Skill-Based Reinforcement Learning with Intrinsic Reward Matching
Ademi Adeniji
Amber Xie
Pieter Abbeel
OffRL
17
5
0
14 Oct 2022
Reward Learning with Trees: Methods and Evaluation
Tom Bewley
J. Lawry
Arthur G. Richards
R. Craddock
Ian Henderson
23
1
0
03 Oct 2022
Calculus on MDPs: Potential Shaping as a Gradient
Erik Jenner
H. V. Hoof
Adam Gleave
17
4
0
20 Aug 2022
Causal Confusion and Reward Misidentification in Preference-Based Reward Learning
J. Tien
Jerry Zhi-Yang He
Zackory M. Erickson
Anca Dragan
Daniel S. Brown
CML
33
39
0
13 Apr 2022
Preprocessing Reward Functions for Interpretability
Erik Jenner
Adam Gleave
11
7
0
25 Mar 2022
Interpretable Preference-based Reinforcement Learning with Tree-Structured Reward Functions
Tom Bewley
Freddy Lecue
OffRL
6
11
0
20 Dec 2021
Maximum Entropy RL (Provably) Solves Some Robust RL Problems
Benjamin Eysenbach
Sergey Levine
OOD
24
174
0
10 Mar 2021
Quantifying Differences in Reward Functions
Adam Gleave
Michael Dennis
Shane Legg
Stuart J. Russell
Jan Leike
OffRL
15
66
0
24 Jun 2020
1