ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2209.13085
  4. Cited By
Defining and Characterizing Reward Hacking
v1v2 (latest)

Defining and Characterizing Reward Hacking

27 September 2022
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
ArXiv (abs)PDFHTML

Papers citing "Defining and Characterizing Reward Hacking"

50 / 60 papers shown
SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
Yixuan Tang
Yi Yang
ALM
157
0
0
02 Dec 2025
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
X. Hou
Shaoyuan Xu
Manan Biyani
Mayan Li
Jia-Wei Liu
Todd C. Hollon
Bryan Wang
140
0
0
24 Nov 2025
Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning
Pei Yang
Ke Zhang
Ji Wang
Xiao Chen
Yuxin Tang
Eric Yang
Lynn Ai
Bill Shi
LRM
187
0
0
20 Nov 2025
Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?
Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?
Willem Fourie
100
0
0
29 Oct 2025
Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
Qingru Zhang
Liang Qiu
Ilgee Hong
Zhenghao Xu
Tianyi Liu
...
Bing Yin
Chao Zhang
Jianshu Chen
Haoming Jiang
T. Zhao
89
1
0
24 Oct 2025
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
Stephane Hatgis-Kessell
Logan Mondal Bhamidipaty
Emma Brunskill
125
0
0
14 Oct 2025
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents
PoU: Proof-of-Use to Counter Tool-Call Hacking in DeepResearch Agents
Shengjie Ma
Chenlong Deng
Jiaxin Mao
J. Huang
Teng Wang
Junjie Wu
Changwang Zhang
Jun Wang
92
1
0
13 Oct 2025
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
Jiayuan Sheng
Hanyang Zhao
Haoxian Chen
David Yao
Wenpin Tang
142
0
0
12 Oct 2025
TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance
TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance
Jianhui Yang
Yiming Jin
Pengkun Jiao
Chenhe Dong
Zerui Huang
Shaowei Yao
Xiaojiang Zhou
Dan Ou
Haihong Tang
LRM
181
0
0
09 Oct 2025
Vul-R2: A Reasoning LLM for Automated Vulnerability Repair
Vul-R2: A Reasoning LLM for Automated Vulnerability Repair
Xin-Cheng Wen
Zirui Lin
Yijun Yang
Cuiyun Gao
Deheng Ye
LRM
116
2
0
07 Oct 2025
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu
Hao Xu
Xuhong Chen
Wei Chen
Yee Whye Teh
Ning Miao
ReLMLRMAI4CE
278
0
0
02 Oct 2025
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang
Nitish Joshi
Barbara Plank
Rico Angell
He He
LRM
210
1
1
01 Oct 2025
Learn to Guide Your Diffusion Model
Learn to Guide Your Diffusion Model
Alexandre Galashov
Ashwini Pokle
Arnaud Doucet
Arthur Gretton
Mauricio Delbracio
Valentin De Bortoli
DiffM
441
0
0
01 Oct 2025
Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning
Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning
Zifeng Ding
Shenyang Huang
Zeyu Cao
Emma Kondrup
Zachary Yang
...
Xianglong Hu
Yuan He
Farimah Poursafaei
Michael M. Bronstein
Andreas Vlachos
AI4TSReLMLRM
213
0
0
31 Aug 2025
PentestJudge: Judging Agent Behavior Against Operational Requirements
PentestJudge: Judging Agent Behavior Against Operational Requirements
Shane Caldwell
Max Harley
Michael Kouremetis
Vincent Abruzzo
Will Pearce
LLMAGELM
140
0
0
04 Aug 2025
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement
Víctor Gallego
LRM
121
1
0
24 Jul 2025
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data
Andrew C. Li
Toryn Q. Klassen
Andrew Wang
P. A. Alamdari
Sheila A. McIlraith
LM&Ro
288
1
0
14 Jul 2025
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Kyung Rok Kim
Yumo Bai
Chonghuan Wang
Guanting Chen
276
0
0
03 Jun 2025
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Kaiyang Guo
Yinchuan Li
Zhitang Chen
352
1
0
29 May 2025
Self-Correcting Code Generation Using Small Language Models
Self-Correcting Code Generation Using Small Language Models
Jeonghun Cho
Deokhyung Kang
Hyounghun Kim
Gary Lee
KELM3DVLRM
266
0
0
29 May 2025
Attention-Based Reward Shaping for Sparse and Delayed Rewards
Attention-Based Reward Shaping for Sparse and Delayed Rewards
Ian Holmes
Min Chi
OffRL
269
2
0
16 May 2025
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen
Xiaopeng Li
Zhiyu Li
Xi Chen
Tianyi Lin
517
0
0
16 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
373
9
0
05 May 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
353
29
0
12 Apr 2025
Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF
Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF
Syrine Belakaria
Joshua Kazdan
Charles Marx
Chris Cundy
Willie Neiswanger
Sanmi Koyejo
Barbara Engelhardt
Stefano Ermon
327
2
0
28 Mar 2025
VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences
VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences
Anukriti Singh
Amisha Bhaskar
Peihong Yu
Souradip Chakraborty
Ruthwik Dasyam
Amrit Singh Bedi
Erfaun Noorani
343
3
0
18 Mar 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
1.0K
3
0
27 Feb 2025
Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
Jie Ren
Yuhang Zhang
Dongrui Liu
Xiaopeng Zhang
Qi Tian
281
5
0
01 Feb 2025
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
345
8
0
05 Sep 2024
Exploring and Addressing Reward Confusion in Offline Preference Learning
Exploring and Addressing Reward Confusion in Offline Preference Learning
Xin Chen
Sam Toyer
Florian Shkurti
OffRL
179
0
0
22 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
352
37
0
06 Jul 2024
LLM Critics Help Catch LLM Bugs
LLM Critics Help Catch LLM Bugs
Nat McAleese
Rai Michael Pokorny
Juan Felipe Cerón Uribe
Evgenia Nitishinskaya
Maja Trebacz
Jan Leike
ALMLRM
242
123
0
28 Jun 2024
Iterative Sizing Field Prediction for Adaptive Mesh Generation From
  Expert Demonstrations
Iterative Sizing Field Prediction for Adaptive Mesh Generation From Expert Demonstrations
Niklas Freymuth
Philipp Dahlinger
Tobias Würth
P. Becker
Aleksandar Taranovic
Onno Grönheim
Luise Kärger
Gerhard Neumann
AI4CE
222
4
0
20 Jun 2024
Interpretable Preferences via Multi-Objective Reward Modeling and
  Mixture-of-Experts
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-ExpertsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Haoxiang Wang
Wei Xiong
Tengyang Xie
Han Zhao
Tong Zhang
296
302
0
18 Jun 2024
FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning
FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning
Yuwei Fu
Haichao Zhang
Di Wu
Wei Xu
Benoit Boulet
VLM
362
25
0
02 Jun 2024
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
Tim Baumgärtner
Yang Gao
Dana Alon
Donald Metzler
AAML
226
33
0
08 Apr 2024
DACO: Towards Application-Driven and Comprehensive Data Analysis via
  Code Generation
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation
Xueqing Wu
Rui Zheng
Jingzhen Sha
Te-Lin Wu
Hanyu Zhou
Mohan Tang
Kai-Wei Chang
Nanyun Peng
Haoran Huang
246
5
0
04 Mar 2024
BRAIn: Bayesian Reward-conditioned Amortized Inference for natural
  language generation from feedback
BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback
Gaurav Pandey
Yatin Nandwani
Tahira Naseem
Mayank Mishra
Guangxuan Xu
Lucian Popa
Sachindra Joshi
Asim Munawar
Ramón Fernández Astudillo
BDL
209
5
0
04 Feb 2024
On the Limitations of Markovian Rewards to Express Multi-Objective,
  Risk-Sensitive, and Modal Tasks
On the Limitations of Markovian Rewards to Express Multi-Objective, Risk-Sensitive, and Modal TasksConference on Uncertainty in Artificial Intelligence (UAI), 2024
Joar Skalse
Alessandro Abate
216
12
0
26 Jan 2024
Diffusion Model Alignment Using Direct Preference Optimization
Diffusion Model Alignment Using Direct Preference OptimizationComputer Vision and Pattern Recognition (CVPR), 2023
Bram Wallace
Meihua Dang
Rafael Rafailov
Linqi Zhou
Aaron Lou
Senthil Purushwalkam
Stefano Ermon
Caiming Xiong
Shafiq Joty
Nikhil Naik
EGVM
448
512
0
21 Nov 2023
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations
Joey Hong
Sergey Levine
Anca Dragan
OffRLLLMAG
253
33
0
09 Nov 2023
Active teacher selection for reinforcement learning from human feedback
Active teacher selection for reinforcement learning from human feedback
Rachel Freedman
Justin Svegliato
K. H. Wray
Stuart J. Russell
362
6
0
23 Oct 2023
Improving Generalization of Alignment with Human Preferences through
  Group Invariant Learning
Improving Generalization of Alignment with Human Preferences through Group Invariant LearningInternational Conference on Learning Representations (ICLR), 2023
Rui Zheng
Wei Shen
Yuan Hua
Wenbin Lai
Jiajun Sun
...
Xiao Wang
Haoran Huang
Tao Gui
Tao Gui
Xuanjing Huang
285
22
0
18 Oct 2023
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning
  from Human Feedback
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human FeedbackConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Wei Shen
Rui Zheng
Wenyu Zhan
Jun Zhao
Jiajun Sun
Tao Gui
Tao Gui
Xuanjing Huang
ALM
393
73
0
08 Oct 2023
A Long Way to Go: Investigating Length Correlations in RLHF
A Long Way to Go: Investigating Length Correlations in RLHF
Prasann Singhal
Tanya Goyal
Jiacheng Xu
Greg Durrett
499
206
0
05 Oct 2023
Searching for High-Value Molecules Using Reinforcement Learning and
  Transformers
Searching for High-Value Molecules Using Reinforcement Learning and TransformersInternational Conference on Learning Representations (ICLR), 2023
Raj Ghugare
Santiago Miret
Adriana Hugessen
Mariano Phielipp
Glen Berseth
285
19
0
04 Oct 2023
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
Motif: Intrinsic Motivation from Artificial Intelligence FeedbackInternational Conference on Learning Representations (ICLR), 2023
Martin Klissarov
P. DÓro
Shagun Sodhani
Roberta Raileanu
Pierre-Luc Bacon
Pascal Vincent
Amy Zhang
Mikael Henaff
LRMLLMAG
250
89
0
29 Sep 2023
STARC: A General Framework For Quantifying Differences Between Reward
  Functions
STARC: A General Framework For Quantifying Differences Between Reward FunctionsInternational Conference on Learning Representations (ICLR), 2023
Joar Skalse
Lucy Farnik
S. Motwani
Erik Jenner
Adam Gleave
Alessandro Abate
208
11
0
26 Sep 2023
Regularizing Adversarial Imitation Learning Using Causal Invariance
Regularizing Adversarial Imitation Learning Using Causal Invariance
Ivan Ovinnikov
J. M. Buhmann
CML
179
0
0
17 Aug 2023
Reinforced Self-Training (ReST) for Language Modeling
Reinforced Self-Training (ReST) for Language Modeling
Çağlar Gülçehre
T. Paine
S. Srinivasan
Ksenia Konyushkova
L. Weerts
...
Chenjie Gu
Wolfgang Macherey
Arnaud Doucet
Orhan Firat
Nando de Freitas
OffRL
357
377
0
17 Aug 2023
12
Next