Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.04549
Cited By
Spontaneous Reward Hacking in Iterative Self-Refinement
5 July 2024
Jane Pan
He He
Samuel R. Bowman
Shi Feng
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Spontaneous Reward Hacking in Iterative Self-Refinement"
5 / 5 papers shown
Title
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
60
0
0
05 May 2025
Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits
Tuhin Chakrabarty
Philippe Laban
C. Wu
45
8
0
22 Sep 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
41
13
0
17 Jun 2024
"Oops, Did I Just Say That?" Testing and Repairing Unethical Suggestions of Large Language Models with Suggest-Critique-Reflect Process
Anna Glazkova
Zongjie Li
Michael Kadantsev
Maksim Glazkov
KELM
22
14
0
04 May 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
206
559
0
03 May 2023
1