v1v2 (latest)

Defining and Characterizing Reward Hacking

27 September 2022

Joar Skalse

Nikolaus H. R. Howe

Dmitrii Krasheninnikov

David M. Krueger

ArXiv (abs)PDF HTML

Papers citing "Defining and Characterizing Reward Hacking"

10 / 60 papers shown

ZYN: Zero-Shot Reward Models with Yes-No Questions for RLAIF

Víctor Gallego

SyDa

251

11 Aug 2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

...

Dorsa Sadigh

Dylan Hadfield-Menell

ALM OffRL

358

712

27 Jul 2023

Learning to Generate Better Than Your LLM

272

20 Jun 2023

Machine Love

Joel Lehman

290

18 Feb 2023

On The Fragility of Learned Reward Functions

Lev McKinney

Yawen Duan

David M. Krueger

Adam Gleave

175

09 Jan 2023

Misspecification in Inverse Reinforcement LearningAAAI Conference on Artificial Intelligence (AAAI), 2022

Joar Skalse

Alessandro Abate

216

06 Dec 2022

Reward Gaming in Conditional Text GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

370

16 Nov 2022

Scaling Laws for Reward Model OveroptimizationInternational Conference on Machine Learning (ICML), 2022

376

776

19 Oct 2022

The Alignment Problem from a Deep Learning PerspectiveInternational Conference on Learning Representations (ICLR), 2022

Richard Ngo

Lawrence Chan

Sören Mindermann

534

247

30 Aug 2022

Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking

314

13 May 2022