ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.19409
  4. Cited By
Countering Reward Over-optimization in LLM with Demonstration-Guided
  Reinforcement Learning

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

30 April 2024
Mathieu Rita
Florian Strub
Rahma Chaabouni
Paul Michel
Emmanuel Dupoux
Olivier Pietquin
ArXiv (abs)PDFHTML

Papers citing "Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning"

9 / 9 papers shown
Title
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
Siyao Song
Cong Ma
Zhihao Cheng
Shiye Lei
Minghao Li
Ying Zeng
Huaixiao Tou
Kai Jia
OffRLLRM
103
0
0
28 Sep 2025
LLM-Driven Self-Refinement for Embodied Drone Task Planning
LLM-Driven Self-Refinement for Embodied Drone Task Planning
Deyu Zhang
Xicheng Zhang
Jiahao Li
Tingting Long
Xunhua Dai
Yongjian Fu
Jinrui Zhang
Ju Ren
Yaoxue Zhang
80
0
0
21 Aug 2025
MEMETRON: Metaheuristic Mechanisms for Test-time Response Optimization of Large Language Models
S. Nguyen
Theja Tulabandhula
151
0
0
10 Jun 2025
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Kaiyang Guo
Yinchuan Li
Zhitang Chen
310
0
0
29 May 2025
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Duy Nguyen
Archiki Prasad
Elias Stengel-Eskin
Joey Tianyi Zhou
336
5
0
02 Oct 2024
Post-hoc Reward Calibration: A Case Study on Length Bias
Post-hoc Reward Calibration: A Case Study on Length BiasInternational Conference on Learning Representations (ICLR), 2024
Zeyu Huang
Zihan Qiu
Zili Wang
Edoardo M. Ponti
Ivan Titov
254
11
0
25 Sep 2024
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Huanqian Wang
Yang Yue
Rui Lu
Jingxin Shi
Andrew Zhao
Shenzhi Wang
Shiji Song
Gao Huang
LM&RoKELM
356
15
0
11 Jul 2024
Robust Preference Optimization through Reward Model Distillation
Robust Preference Optimization through Reward Model Distillation
Adam Fisch
Jacob Eisenstein
Vicky Zayats
Alekh Agarwal
Ahmad Beirami
Chirag Nagpal
Peter Shaw
Jonathan Berant
385
57
0
29 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is
  Implicitly an Adversarial Regularizer
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
311
82
0
26 May 2024
1