Spontaneous Reward Hacking in Iterative Self-Refinement

Spontaneous Reward Hacking in Iterative Self-Refinement

5 July 2024

Samuel R. Bowman

Shi Feng

Papers citing "Spontaneous Reward Hacking in Iterative Self-Refinement"

5 / 5 papers shown

Title
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models Xiaobao Wu LRM 60 0 0 05 May 2025
Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits Tuhin Chakrabarty Philippe Laban C. Wu 45 8 0 22 Sep 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization Wenkai Yang Shiqi Shen Guangyao Shen Zhi Gong Yankai Lin Zhi Gong Yankai Lin Ji-Rong Wen 41 13 0 17 Jun 2024
"Oops, Did I Just Say That?" Testing and Repairing Unethical Suggestions of Large Language Models with Suggest-Critique-Reflect Process Anna Glazkova Zongjie Li Michael Kadantsev Maksim Glazkov KELM 22 14 0 04 May 2023
Can Large Language Models Be an Alternative to Human Evaluations? Cheng-Han Chiang Hung-yi Lee ALM LM&MA 206 559 0 03 May 2023