ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.03544
  4. Cited By
The Effects of Reward Misspecification: Mapping and Mitigating
  Misaligned Models

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

10 January 2022
Alexander Pan
Kush S. Bhatia
Jacob Steinhardt
ArXivPDFHTML

Papers citing "The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models"

28 / 128 papers shown
Title
Survival Instinct in Offline Reinforcement Learning
Survival Instinct in Offline Reinforcement Learning
Anqi Li
Dipendra Kumar Misra
Andrey Kolobov
Ching-An Cheng
OffRL
10
15
0
05 Jun 2023
PAGAR: Taming Reward Misalignment in Inverse Reinforcement
  Learning-Based Imitation Learning with Protagonist Antagonist Guided
  Adversarial Reward
PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward
Weichao Zhou
Wenchao Li
16
0
0
02 Jun 2023
Factually Consistent Summarization via Reinforcement Learning with
  Textual Entailment Feedback
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Paul Roit
Johan Ferret
Lior Shani
Roee Aharoni
Geoffrey Cideron
...
Olivier Bachem
G. Elidan
Avinatan Hassidim
Olivier Pietquin
Idan Szpektor
HILM
15
74
0
31 May 2023
Training Socially Aligned Language Models on Simulated Social
  Interactions
Training Socially Aligned Language Models on Simulated Social Interactions
Ruibo Liu
Ruixin Yang
Chenyan Jia
Ge Zhang
Denny Zhou
Andrew M. Dai
Diyi Yang
Soroush Vosoughi
ALM
18
43
0
26 May 2023
Fundamental Limitations of Alignment in Large Language Models
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf
Noam Wies
Oshri Avnery
Yoav Levine
Amnon Shashua
ALM
6
137
0
19 Apr 2023
Positive AI: Key Challenges in Designing Artificial Intelligence for
  Wellbeing
Positive AI: Key Challenges in Designing Artificial Intelligence for Wellbeing
Willem van der Maden
Derek Lomas
Malak Sadek
P. Hekkert
19
1
0
12 Apr 2023
Natural Selection Favors AIs over Humans
Natural Selection Favors AIs over Humans
Dan Hendrycks
13
31
0
28 Mar 2023
Reward Design with Language Models
Reward Design with Language Models
Minae Kwon
Sang Michael Xie
Kalesha Bullard
Dorsa Sadigh
LM&Ro
8
197
0
27 Feb 2023
Progress measures for grokking via mechanistic interpretability
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
26
378
0
12 Jan 2023
On The Fragility of Learned Reward Functions
On The Fragility of Learned Reward Functions
Lev McKinney
Yawen Duan
David M. Krueger
Adam Gleave
15
19
0
09 Jan 2023
Few-Shot Preference Learning for Human-in-the-Loop RL
Few-Shot Preference Learning for Human-in-the-Loop RL
Joey Hejna
Dorsa Sadigh
OffRL
13
88
0
06 Dec 2022
Misspecification in Inverse Reinforcement Learning
Misspecification in Inverse Reinforcement Learning
Joar Skalse
Alessandro Abate
20
21
0
06 Dec 2022
Reward Gaming in Conditional Text Generation
Reward Gaming in Conditional Text Generation
Richard Yuanzhe Pang
Vishakh Padmakumar
Thibault Sellam
Ankur P. Parikh
He He
21
24
0
16 Nov 2022
Policy Optimization with Advantage Regularization for Long-Term Fairness
  in Decision Systems
Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems
Eric Yang Yu
Zhizhen Qin
Min Kyung Lee
Sicun Gao
OffRL
22
9
0
22 Oct 2022
Redefining Counterfactual Explanations for Reinforcement Learning:
  Overview, Challenges and Opportunities
Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities
Jasmina Gajcin
Ivana Dusparic
CML
OffRL
12
7
0
21 Oct 2022
Scaling Laws for Reward Model Overoptimization
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
17
463
0
19 Oct 2022
Reward Learning with Trees: Methods and Evaluation
Reward Learning with Trees: Methods and Evaluation
Tom Bewley
J. Lawry
Arthur G. Richards
R. Craddock
Ian Henderson
18
1
0
03 Oct 2022
Defining and Characterizing Reward Hacking
Defining and Characterizing Reward Hacking
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
57
53
0
27 Sep 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
240
453
0
24 Sep 2022
The Alignment Problem from a Deep Learning Perspective
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
29
180
0
30 Aug 2022
RL with KL penalties is better viewed as Bayesian inference
RL with KL penalties is better viewed as Bayesian inference
Tomasz Korbak
Ethan Perez
Christopher L. Buckley
OffRL
22
70
0
23 May 2022
Causal Confusion and Reward Misidentification in Preference-Based Reward
  Learning
Causal Confusion and Reward Misidentification in Preference-Based Reward Learning
J. Tien
Jerry Zhi-Yang He
Zackory M. Erickson
Anca Dragan
Daniel S. Brown
CML
20
39
0
13 Apr 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning
  from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
11
2,281
0
12 Apr 2022
Unsolved Problems in ML Safety
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
164
268
0
28 Sep 2021
Goal Misgeneralization in Deep Reinforcement Learning
Goal Misgeneralization in Deep Reinforcement Learning
L. Langosco
Jack Koch
Lee D. Sharkey
J. Pfau
Laurent Orseau
David M. Krueger
17
77
0
28 May 2021
Reward (Mis)design for Autonomous Driving
Reward (Mis)design for Autonomous Driving
W. B. Knox
A. Allievi
Holger Banzhaf
Felix Schmitt
Peter Stone
67
112
0
28 Apr 2021
Reinforcement Learning for Optimization of COVID-19 Mitigation policies
Reinforcement Learning for Optimization of COVID-19 Mitigation policies
Varun Kompella
Roberto Capobianco
Stacy Jong
Jonathan Browne
S. Fox
L. Meyers
Peter R. Wurman
Peter Stone
62
46
0
20 Oct 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
223
4,424
0
23 Jan 2020
Previous
123