Preprocessing Reward Functions for Interpretability

Preprocessing Reward Functions for Interpretability

25 March 2022

Adam Gleave

Papers citing "Preprocessing Reward Functions for Interpretability"

5 / 5 papers shown

Title
Explaining Learned Reward Functions with Counterfactual Trajectories Jan Wehner Frans Oliehoek Luciano Cavalcante Siebert 29 0 0 07 Feb 2024
Learning Interpretable Models of Aircraft Handling Behaviour by Reinforcement Learning from Human Feedback Tom Bewley J. Lawry Arthur G. Richards 30 1 0 26 May 2023
Reward Learning with Trees: Methods and Evaluation Tom Bewley J. Lawry Arthur G. Richards R. Craddock Ian Henderson 23 1 0 03 Oct 2022
Calculus on MDPs: Potential Shaping as a Gradient Erik Jenner H. V. Hoof Adam Gleave 22 4 0 20 Aug 2022
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 280 1,595 0 18 Sep 2019