Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2209.13085
Cited By
v1
v2 (latest)
Defining and Characterizing Reward Hacking
27 September 2022
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Defining and Characterizing Reward Hacking"
50 / 59 papers shown
Title
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
X. Hou
Shaoyuan Xu
Manan Biyani
Mayan Li
Jia-Wei Liu
Todd C. Hollon
Bryan Wang
69
0
0
24 Nov 2025
Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
Pei Yang
Ke Zhang
Ji Wang
Xiao Chen
Yuxin Tang
Eric Yang
Lynn Ai
Bill Shi
LRM
128
0
0
20 Nov 2025
Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?
Willem Fourie
80
0
0
29 Oct 2025
Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
Qingru Zhang
Liang Qiu
Ilgee Hong
Zhenghao Xu
Tianyi Liu
...
Bing Yin
Chao Zhang
Jianshu Chen
Haoming Jiang
T. Zhao
68
1
0
24 Oct 2025
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
Stephane Hatgis-Kessell
Logan Mondal Bhamidipaty
Emma Brunskill
85
0
0
14 Oct 2025
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents
Shengjie Ma
Chenlong Deng
Jiaxin Mao
J. Huang
Teng Wang
Junjie Wu
Changwang Zhang
Jun Wang
72
1
0
13 Oct 2025
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
Jiayuan Sheng
Hanyang Zhao
Haoxian Chen
David Yao
Wenpin Tang
95
0
0
12 Oct 2025
TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance
Jianhui Yang
Yiming Jin
Pengkun Jiao
Chenhe Dong
Zerui Huang
Shaowei Yao
Xiaojiang Zhou
Dan Ou
Haihong Tang
LRM
132
0
0
09 Oct 2025
Vul-R2: A Reasoning LLM for Automated Vulnerability Repair
Xin-Cheng Wen
Zirui Lin
Yijun Yang
Cuiyun Gao
Deheng Ye
LRM
84
2
0
07 Oct 2025
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu
Hao Xu
Xuhong Chen
Wei Chen
Yee Whye Teh
Ning Miao
ReLM
LRM
AI4CE
258
0
0
02 Oct 2025
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang
Nitish Joshi
Barbara Plank
Rico Angell
He He
LRM
187
0
1
01 Oct 2025
Learn to Guide Your Diffusion Model
Alexandre Galashov
Ashwini Pokle
Arnaud Doucet
Arthur Gretton
Mauricio Delbracio
Valentin De Bortoli
DiffM
378
0
0
01 Oct 2025
Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning
Zifeng Ding
Shenyang Huang
Zeyu Cao
Emma Kondrup
Zachary Yang
...
Xianglong Hu
Yuan He
Farimah Poursafaei
Michael M. Bronstein
Andreas Vlachos
AI4TS
ReLM
LRM
178
0
0
31 Aug 2025
PentestJudge: Judging Agent Behavior Against Operational Requirements
Shane Caldwell
Max Harley
Michael Kouremetis
Vincent Abruzzo
Will Pearce
LLMAG
ELM
111
0
0
04 Aug 2025
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Víctor Gallego
LRM
90
0
0
24 Jul 2025
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data
Andrew C. Li
Toryn Q. Klassen
Andrew Wang
P. A. Alamdari
Sheila A. McIlraith
LM&Ro
238
1
0
14 Jul 2025
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Kyung Rok Kim
Yumo Bai
Chonghuan Wang
Guanting Chen
180
0
0
03 Jun 2025
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Kaiyang Guo
Yinchuan Li
Zhitang Chen
298
0
0
29 May 2025
Self-Correcting Code Generation Using Small Language Models
Jeonghun Cho
Deokhyung Kang
Hyounghun Kim
Gary Lee
KELM
3DV
LRM
225
0
0
29 May 2025
Attention-Based Reward Shaping for Sparse and Delayed Rewards
Ian Holmes
Min Chi
OffRL
250
1
0
16 May 2025
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen
Xiaopeng Li
Zhiyu Li
Xi Chen
Tianyi Lin
455
0
0
16 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
268
9
0
05 May 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
296
26
0
12 Apr 2025
Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF
Syrine Belakaria
Joshua Kazdan
Charles Marx
Chris Cundy
Willie Neiswanger
Sanmi Koyejo
Barbara Engelhardt
Stefano Ermon
277
1
0
28 Mar 2025
VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences
Anukriti Singh
Amisha Bhaskar
Peihong Yu
Souradip Chakraborty
Ruthwik Dasyam
Amrit Singh Bedi
Erfaun Noorani
292
2
0
18 Mar 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
952
3
0
27 Feb 2025
Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
Jie Ren
Yuhang Zhang
Dongrui Liu
Xiaopeng Zhang
Qi Tian
197
4
0
01 Feb 2025
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
303
8
0
05 Sep 2024
Exploring and Addressing Reward Confusion in Offline Preference Learning
Xin Chen
Sam Toyer
Florian Shkurti
OffRL
139
0
0
22 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
321
35
0
06 Jul 2024
LLM Critics Help Catch LLM Bugs
Nat McAleese
Rai Michael Pokorny
Juan Felipe Cerón Uribe
Evgenia Nitishinskaya
Maja Trebacz
Jan Leike
ALM
LRM
197
119
0
28 Jun 2024
Iterative Sizing Field Prediction for Adaptive Mesh Generation From Expert Demonstrations
Niklas Freymuth
Philipp Dahlinger
Tobias Würth
P. Becker
Aleksandar Taranovic
Onno Grönheim
Luise Kärger
Gerhard Neumann
AI4CE
186
4
0
20 Jun 2024
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Haoxiang Wang
Wei Xiong
Tengyang Xie
Han Zhao
Tong Zhang
260
293
0
18 Jun 2024
FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning
Yuwei Fu
Haichao Zhang
Di Wu
Wei Xu
Benoit Boulet
VLM
308
24
0
02 Jun 2024
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
Tim Baumgärtner
Yang Gao
Dana Alon
Donald Metzler
AAML
192
32
0
08 Apr 2024
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation
Xueqing Wu
Rui Zheng
Jingzhen Sha
Te-Lin Wu
Hanyu Zhou
Mohan Tang
Kai-Wei Chang
Nanyun Peng
Haoran Huang
214
3
0
04 Mar 2024
BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback
Gaurav Pandey
Yatin Nandwani
Tahira Naseem
Mayank Mishra
Guangxuan Xu
Lucian Popa
Sachindra Joshi
Asim Munawar
Ramón Fernández Astudillo
BDL
166
5
0
04 Feb 2024
On the Limitations of Markovian Rewards to Express Multi-Objective, Risk-Sensitive, and Modal Tasks
Conference on Uncertainty in Artificial Intelligence (UAI), 2024
Joar Skalse
Alessandro Abate
191
12
0
26 Jan 2024
Diffusion Model Alignment Using Direct Preference Optimization
Computer Vision and Pattern Recognition (CVPR), 2023
Bram Wallace
Meihua Dang
Rafael Rafailov
Linqi Zhou
Aaron Lou
Senthil Purushwalkam
Stefano Ermon
Caiming Xiong
Shafiq Joty
Nikhil Naik
EGVM
397
479
0
21 Nov 2023
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations
Joey Hong
Sergey Levine
Anca Dragan
OffRL
LLMAG
200
32
0
09 Nov 2023
Active teacher selection for reinforcement learning from human feedback
Rachel Freedman
Justin Svegliato
K. H. Wray
Stuart J. Russell
295
6
0
23 Oct 2023
Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
International Conference on Learning Representations (ICLR), 2023
Rui Zheng
Wei Shen
Yuan Hua
Wenbin Lai
Jiajun Sun
...
Xiao Wang
Haoran Huang
Tao Gui
Tao Gui
Xuanjing Huang
237
22
0
18 Oct 2023
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Wei Shen
Rui Zheng
Wenyu Zhan
Jun Zhao
Jiajun Sun
Tao Gui
Tao Gui
Xuanjing Huang
ALM
292
68
0
08 Oct 2023
A Long Way to Go: Investigating Length Correlations in RLHF
Prasann Singhal
Tanya Goyal
Jiacheng Xu
Greg Durrett
395
201
0
05 Oct 2023
Searching for High-Value Molecules Using Reinforcement Learning and Transformers
International Conference on Learning Representations (ICLR), 2023
Raj Ghugare
Santiago Miret
Adriana Hugessen
Mariano Phielipp
Glen Berseth
205
18
0
04 Oct 2023
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
International Conference on Learning Representations (ICLR), 2023
Martin Klissarov
P. DÓro
Shagun Sodhani
Roberta Raileanu
Pierre-Luc Bacon
Pascal Vincent
Amy Zhang
Mikael Henaff
LRM
LLMAG
215
88
0
29 Sep 2023
STARC: A General Framework For Quantifying Differences Between Reward Functions
International Conference on Learning Representations (ICLR), 2023
Joar Skalse
Lucy Farnik
S. Motwani
Erik Jenner
Adam Gleave
Alessandro Abate
149
11
0
26 Sep 2023
Regularizing Adversarial Imitation Learning Using Causal Invariance
Ivan Ovinnikov
J. M. Buhmann
CML
138
0
0
17 Aug 2023
Reinforced Self-Training (ReST) for Language Modeling
Çağlar Gülçehre
T. Paine
S. Srinivasan
Ksenia Konyushkova
L. Weerts
...
Chenjie Gu
Wolfgang Macherey
Arnaud Doucet
Orhan Firat
Nando de Freitas
OffRL
279
369
0
17 Aug 2023
ZYN: Zero-Shot Reward Models with Yes-No Questions for RLAIF
Víctor Gallego
SyDa
180
4
0
11 Aug 2023
1
2
Next