Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.06627
Cited By
Feedback Loops With Language Models Drive In-Context Reward Hacking
9 February 2024
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Feedback Loops With Language Models Drive In-Context Reward Hacking"
12 / 12 papers shown
Title
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
60
0
0
05 May 2025
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
Shaokun Zhang
Yi Dong
Jieyu Zhang
Jan Kautz
Bryan Catanzaro
Andrew Tao
Qingyun Wu
Zhiding Yu
Guilin Liu
LLMAG
OffRL
KELM
LRM
83
0
0
25 Apr 2025
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang
Zhangyi Jiang
Zhenqi He
Wenhan Yang
Yanan Zheng
Zeyu Li
Zifan He
Shenyang Tong
Hailei Gong
LRM
82
1
0
16 Mar 2025
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
28
13
0
17 Jun 2024
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery
Samuel R. Bowman
Shi Feng
20
152
0
15 Apr 2024
Data Feedback Loops: Model-driven Amplification of Dataset Biases
Rohan Taori
Tatsunori B. Hashimoto
61
42
0
08 Sep 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
213
327
0
23 Aug 2022
Breaking Feedback Loops in Recommender Systems with Causal Inference
K. Krauth
Yixin Wang
Michael I. Jordan
CML
33
19
0
04 Jul 2022
Preference Dynamics Under Personalized Recommendations
Sarah Dean
Jamie Morgenstern
45
24
0
25 May 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
156
268
0
28 Sep 2021
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
261
1,386
0
14 Dec 2020
How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility
A. Chaney
Brandon M Stewart
Barbara E. Engelhardt
CML
161
281
0
30 Oct 2017
1