ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.06627
  4. Cited By
Feedback Loops With Language Models Drive In-Context Reward Hacking

Feedback Loops With Language Models Drive In-Context Reward Hacking

9 February 2024
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
    KELM
ArXivPDFHTML

Papers citing "Feedback Loops With Language Models Drive In-Context Reward Hacking"

12 / 12 papers shown
Title
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
60
0
0
05 May 2025
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
Shaokun Zhang
Yi Dong
Jieyu Zhang
Jan Kautz
Bryan Catanzaro
Andrew Tao
Qingyun Wu
Zhiding Yu
Guilin Liu
LLMAG
OffRL
KELM
LRM
83
0
0
25 Apr 2025
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang
Zhangyi Jiang
Zhenqi He
Wenhan Yang
Yanan Zheng
Zeyu Li
Zifan He
Shenyang Tong
Hailei Gong
LRM
82
1
0
16 Mar 2025
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
28
13
0
17 Jun 2024
LLM Evaluators Recognize and Favor Their Own Generations
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery
Samuel R. Bowman
Shi Feng
20
152
0
15 Apr 2024
Data Feedback Loops: Model-driven Amplification of Dataset Biases
Data Feedback Loops: Model-driven Amplification of Dataset Biases
Rohan Taori
Tatsunori B. Hashimoto
61
42
0
08 Sep 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
  and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
213
327
0
23 Aug 2022
Breaking Feedback Loops in Recommender Systems with Causal Inference
Breaking Feedback Loops in Recommender Systems with Causal Inference
K. Krauth
Yixin Wang
Michael I. Jordan
CML
33
19
0
04 Jul 2022
Preference Dynamics Under Personalized Recommendations
Preference Dynamics Under Personalized Recommendations
Sarah Dean
Jamie Morgenstern
45
24
0
25 May 2022
Unsolved Problems in ML Safety
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
156
268
0
28 Sep 2021
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
261
1,386
0
14 Dec 2020
How Algorithmic Confounding in Recommendation Systems Increases
  Homogeneity and Decreases Utility
How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility
A. Chaney
Brandon M Stewart
Barbara E. Engelhardt
CML
161
281
0
30 Oct 2017
1