Communities
Connect sessions
AI calendar
Organizations
Contact Sales
Search
Open menu
Home
Papers
All Papers
Title
Home
Papers
2502.18770
Cited By
v1
v2
v3 (latest)
Reward Shaping to Mitigate Reward Hacking in RLHF
26 February 2025
Jiayi Fu
Xuandong Zhao
Chengyuan Yao
Han Wang
Qi Han
Yanghua Xiao
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Reward Shaping to Mitigate Reward Hacking in RLHF"
22 / 22 papers shown
Title
Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey
Qiyuan Liu
Hao Xu
Xuhong Chen
Wei Chen
Yee Whye Teh
Ning Miao
ReLM
LRM
AI4CE
42
0
0
02 Oct 2025
Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF
Jing Liu
0
0
0
29 Sep 2025
Aligning Audio Captions with Human Preferences
Kartik Hegde
Rehana Mahfuz
Yinyi Guo
Erik M. Visser
16
0
0
18 Sep 2025
Pluralistic Off-policy Evaluation and Alignment
Chengkai Huang
Junda Wu
Zhouhang Xie
Yu Xia
Rui Wang
Tong Yu
Subrata Mitra
Julian McAuley
L. Yao
OffRL
32
0
0
15 Sep 2025
Virtual Agent Economies
Nenad Tomašev
Matija Franklin
Joel Z. Leibo
Julian Jacobs
William A. Cunningham
Iason Gabriel
Simon Osindero
40
0
0
12 Sep 2025
The Anti-Ouroboros Effect: Emergent Resilience in Large Language Models from Recursive Selective Feedback
Sai Teja Reddy Adapala
20
0
0
02 Sep 2025
Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning
Zhiwei Li
Yong Hu
Wenqing Wang
LLMAG
24
0
0
27 Aug 2025
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Zongxia Li
Wenhao Yu
Chengsong Huang
Rui Liu
Zhenwen Liang
...
Jingxi Che
Dian Yu
Jordan L. Boyd-Graber
Haitao Mi
Dong Yu
ReLM
VLM
LRM
27
12
0
27 Aug 2025
A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models
Wenkai Wang
Hongcan Guo
Zheqi Lv
Shengyu Zhang
16
0
0
05 Aug 2025
CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
Guofu Xie
Yunsheng Shi
Hongtao Tian
Ting Yao
Xiao Zhang
OffRL
LRM
50
0
0
04 Aug 2025
Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation
Jianxiang Zang
Meiling Ning
Shihan Dou
Jiazheng Zhang
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
43
0
0
04 Aug 2025
AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning
Tevin Wang
Chenyan Xiong
LRM
127
2
0
18 Jun 2025
Self-Adapting Language Models
Adam Zweiger
Jyothish Pari
Han Guo
Ekin Akyürek
Yoon Kim
Pulkit Agrawal
KELM
LRM
272
8
0
12 Jun 2025
RewardAnything: Generalizable Principle-Following Reward Models
Zhuohao Yu
Jiali Zeng
Weizheng Gu
Yidong Wang
Jindong Wang
Fandong Meng
Jie Zhou
Yue Zhang
Shikun Zhang
Wei Ye
LRM
219
7
0
04 Jun 2025
Doubly Robust Alignment for Large Language Models
Erhan Xu
Kai Ye
Hongyi Zhou
Luhan Zhu
Francesco Quinzan
Chengchun Shi
121
0
0
01 Jun 2025
Enhancing Tool Learning in Large Language Models with Hierarchical Error Checklists
Yue Cui
Liuyi Yao
Shuchang Tao
Weijie Shi
Yaliang Li
Bolin Ding
Xiaofang Zhou
80
2
0
28 May 2025
Learning Explainable Dense Reward Shapes via Bayesian Optimization
Ryan Koo
Ian Yang
Vipul Raheja
Mingyi Hong
Kwang-Sung Jun
Dongyeop Kang
128
1
0
22 Apr 2025
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Hardy Chen
Haoqin Tu
Fali Wang
Hui Liu
Xianfeng Tang
Xinya Du
Yuyin Zhou
Cihang Xie
ReLM
VLM
OffRL
LRM
248
91
0
10 Apr 2025
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Pedro Ferreira
Wilker Aziz
Ivan Titov
LRM
175
2
0
07 Apr 2025
Probabilistic Uncertain Reward Model
Wangtao Sun
Xiang Cheng
Xing Yu
Haotian Xu
Zhao Yang
Shizhu He
Jun Zhao
Kang Liu
296
1
0
28 Mar 2025
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Ruilin Luo
Zhuofan Zheng
Yifan Wang
Xinzhe Ni
Zicheng Lin
...
Yiyao Yu
C. Shi
Ruihang Chu
Jin Zeng
Yujiu Yang
LRM
410
34
0
08 Jan 2025
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu
Wei Xiong
Jie Jessie Ren
Lichang Chen
Junru Wu
...
Yuan Liu
Bilal Piot
Abe Ittycheriah
Aviral Kumar
Mohammad Saleh
AAML
141
33
0
20 Sep 2024
1