ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.08714
  4. Cited By
Reward Gaming in Conditional Text Generation

Reward Gaming in Conditional Text Generation

16 November 2022
Richard Yuanzhe Pang
Vishakh Padmakumar
Thibault Sellam
Ankur P. Parikh
He He
ArXivPDFHTML

Papers citing "Reward Gaming in Conditional Text Generation"

24 / 24 papers shown
Title
On the Robustness of Reward Models for Language Model Alignment
On the Robustness of Reward Models for Language Model Alignment
Jiwoo Hong
Noah Lee
Eunki Kim
Guijin Son
Woojin Chung
Aman Gupta
Shao Tang
James Thorne
24
0
0
12 May 2025
CHARM: Calibrating Reward Models With Chatbot Arena Scores
CHARM: Calibrating Reward Models With Chatbot Arena Scores
Xiao Zhu
Chenmien Tan
Pinzhen Chen
Rico Sennrich
Yanlin Zhang
Hanxu Hu
ALM
24
0
0
14 Apr 2025
QE-EBM: Using Quality Estimators as Energy Loss for Machine Translation
QE-EBM: Using Quality Estimators as Energy Loss for Machine Translation
Gahyun Yoo
Jay Yoon Lee
24
0
0
14 Oct 2024
The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield
  Better Language Models
The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models
Yanjun Chen
Dawei Zhu
Yirong Sun
Xinghao Chen
Wei Zhang
Xiaoyu Shen
ALM
26
1
0
09 Oct 2024
Post-hoc Reward Calibration: A Case Study on Length Bias
Post-hoc Reward Calibration: A Case Study on Length Bias
Zeyu Huang
Zihan Qiu
Zili Wang
Edoardo M. Ponti
Ivan Titov
36
5
0
25 Sep 2024
On the Limited Generalization Capability of the Implicit Reward Model
  Induced by Direct Preference Optimization
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
Yong Lin
Skyler Seto
Maartje ter Hoeve
Katherine Metcalf
B. Theobald
Xuan Wang
Yizhe Zhang
Chen Huang
Tong Zhang
29
12
0
05 Sep 2024
Can a Bayesian Oracle Prevent Harm from an Agent?
Can a Bayesian Oracle Prevent Harm from an Agent?
Yoshua Bengio
Michael K. Cohen
Nikolay Malkin
Matt MacDermott
Damiano Fornasiere
Pietro Greiner
Younesse Kaddar
34
4
0
09 Aug 2024
Margin-aware Preference Optimization for Aligning Diffusion Models
  without Reference
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference
Jiwoo Hong
Sayak Paul
Noah Lee
Kashif Rasul
James Thorne
Jongheon Jeong
31
13
0
10 Jun 2024
Stratified Prediction-Powered Inference for Hybrid Language Model
  Evaluation
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation
Adam Fisch
Joshua Maynez
R. A. Hofer
Bhuwan Dhingra
Amir Globerson
William W. Cohen
34
7
0
06 Jun 2024
Offline Regularised Reinforcement Learning for Large Language Models
  Alignment
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond
Yunhao Tang
Daniel Guo
Daniele Calandriello
M. G. Azar
...
Gil Shamir
Rishabh Joshi
Tianqi Liu
Rémi Munos
Bilal Piot
OffRL
40
21
0
29 May 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable
  AI Systems
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
44
51
0
10 May 2024
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from
  Human Feedback for LLMs
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Shreyas Chaudhari
Pranjal Aggarwal
Vishvak Murahari
Tanmay Rajpurohit
A. Kalyan
Karthik Narasimhan
A. Deshpande
Bruno Castro da Silva
21
33
0
12 Apr 2024
Human Alignment of Large Language Models through Online Preference
  Optimisation
Human Alignment of Large Language Models through Online Preference Optimisation
Daniele Calandriello
Daniel Guo
Rémi Munos
Mark Rowland
Yunhao Tang
...
Michal Valko
Tianqi Liu
Rishabh Joshi
Zeyu Zheng
Bilal Piot
41
60
0
13 Mar 2024
Rethinking the Role of Proxy Rewards in Language Model Alignment
Rethinking the Role of Proxy Rewards in Language Model Alignment
Sungdong Kim
Minjoon Seo
SyDa
ALM
23
0
0
02 Feb 2024
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate
  Reward Hacking
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Jacob Eisenstein
Chirag Nagpal
Alekh Agarwal
Ahmad Beirami
Alex DÁmour
...
Katherine Heller
Stephen R. Pfohl
Deepak Ramachandran
Peter Shaw
Jonathan Berant
22
82
0
14 Dec 2023
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method
  for Aligning Large Language Models
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
Ziniu Li
Tian Xu
Yushun Zhang
Zhihang Lin
Yang Yu
Ruoyu Sun
Zhimin Luo
16
45
0
16 Oct 2023
Goodhart's Law in Reinforcement Learning
Goodhart's Law in Reinforcement Learning
Jacek Karwowski
Oliver Hayman
Xingjian Bai
Klaus Kiendlhofer
Charlie Griffin
Joar Skalse
31
7
0
13 Oct 2023
A Long Way to Go: Investigating Length Correlations in RLHF
A Long Way to Go: Investigating Length Correlations in RLHF
Prasann Singhal
Tanya Goyal
Jiacheng Xu
Greg Durrett
34
139
0
05 Oct 2023
Leveraging Implicit Feedback from Deployment Data in Dialogue
Leveraging Implicit Feedback from Deployment Data in Dialogue
Richard Yuanzhe Pang
Stephen Roller
Kyunghyun Cho
He He
Jason Weston
38
7
0
26 Jul 2023
Preference-grounded Token-level Guidance for Language Model Fine-tuning
Preference-grounded Token-level Guidance for Language Model Fine-tuning
Shentao Yang
Shujian Zhang
Congying Xia
Yihao Feng
Caiming Xiong
Mi Zhou
21
22
0
01 Jun 2023
Extrapolative Controlled Sequence Generation via Iterative Refinement
Extrapolative Controlled Sequence Generation via Iterative Refinement
Vishakh Padmakumar
Richard Yuanzhe Pang
He He
Ankur P. Parikh
13
9
0
08 Mar 2023
Defining and Characterizing Reward Hacking
Defining and Characterizing Reward Hacking
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
57
53
0
27 Sep 2022
Relating Neural Text Degeneration to Exposure Bias
Relating Neural Text Degeneration to Exposure Bias
Ting-Rui Chiang
Yun-Nung Chen
45
17
0
17 Sep 2021
Google's Neural Machine Translation System: Bridging the Gap between
  Human and Machine Translation
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
716
6,724
0
26 Sep 2016
1