Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2409.13156
Cited By
RRM: Robust Reward Model Training Mitigates Reward Hacking
20 September 2024
Tianqi Liu
Wei Xiong
Jie Jessie Ren
Lichang Chen
Junru Wu
Rishabh Joshi
Yang Gao
Jiaming Shen
Zhen Qin
Tianhe Yu
Daniel Sohn
Anastasiia Makarova
Jeremiah Liu
Yuan Liu
Bilal Piot
Abe Ittycheriah
Aviral Kumar
Mohammad Saleh
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"RRM: Robust Reward Model Training Mitigates Reward Hacking"
8 / 8 papers shown
Title
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
60
0
0
05 May 2025
Energy-Based Reward Models for Robust Language Model Alignment
Anamika Lochab
Ruqi Zhang
31
0
0
17 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
27
1
0
12 Apr 2025
Information-Theoretic Reward Decomposition for Generalizable RLHF
Liyuan Mao
Haoran Xu
Amy Zhang
Weinan Zhang
Chenjia Bai
28
0
0
08 Apr 2025
Adversarial Training of Reward Models
Alexander Bukharin
Haifeng Qian
Shengyang Sun
Adithya Renduchintala
Soumye Singhal
Z. Wang
Oleksii Kuchaiev
Olivier Delalleau
T. Zhao
AAML
27
0
0
08 Apr 2025
Reward Shaping to Mitigate Reward Hacking in RLHF
Jiayi Fu
Xuandong Zhao
Chengyuan Yao
H. Wang
Qi Han
Yanghua Xiao
74
6
0
26 Feb 2025
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu
Zhengxing Chen
Aston Zhang
L Tan
Chenguang Zhu
...
Suchin Gururangan
Chao-Yue Zhang
Melanie Kambadur
Dhruv Mahajan
Rui Hou
LRM
ALM
70
14
0
25 Nov 2024
RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization
Hanyang Zhao
Genta Indra Winata
Anirban Das
Shi-Xiong Zhang
D. Yao
Wenpin Tang
Sambit Sahu
43
4
0
05 Oct 2024
1