Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2407.14503
Cited By
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
19 July 2024
Thomas Kwa
Drake Thomas
Adrià Garriga-Alonso
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification"
1 / 1 papers shown
Title
Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf
C. M. Verdun
Alex Oesterling
Himabindu Lakkaraju
Flavio du Pin Calmon
51
1
0
24 Jun 2025
1