ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.10342
  4. Cited By
Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on
  Efficient Data Utilization
v1v2 (latest)

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

15 February 2024
Yihan Du
Anna Winnicki
Gal Dalal
Shie Mannor
R. Srikant
ArXiv (abs)PDFHTMLGithub

Papers citing "Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization"

14 / 14 papers shown
On the Role of Preference Variance in Preference Optimization
On the Role of Preference Variance in Preference Optimization
Jiacheng Guo
Zihao Li
Jiahao Qiu
Yue Wu
Mengdi Wang
209
3
0
14 Oct 2025
On the optimization dynamics of RLVR: Gradient gap and step size thresholds
On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Joe Suk
Yaqi Duan
226
1
0
09 Oct 2025
Why is Your Language Model a Poor Implicit Reward Model?
Why is Your Language Model a Poor Implicit Reward Model?
Noam Razin
Yong Lin
Jiarui Yao
Sanjeev Arora
LRM
295
4
0
10 Jul 2025
Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
Xingwu Chen
Tianle Li
Difan Zou
LRM
438
1
0
05 Jun 2025
Provable Reinforcement Learning from Human Feedback with an Unknown Link Function
Provable Reinforcement Learning from Human Feedback with an Unknown Link Function
Qining Zhang
Lei Ying
369
2
0
03 Jun 2025
Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
Adversarial Policy Optimization for Offline Preference-based Reinforcement LearningInternational Conference on Learning Representations (ICLR), 2025
Hyungkyu Kang
Min-hwan Oh
OffRL
444
3
0
07 Mar 2025
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Jiawei Huang
Bingcong Li
Christoph Dann
Niao He
OffRL
687
4
0
26 Feb 2025
Direct Preference Optimization-Enhanced Multi-Guided Diffusion Model for Traffic Scenario Generation
Direct Preference Optimization-Enhanced Multi-Guided Diffusion Model for Traffic Scenario Generation
Seungjun Yu
Kisung Kim
Daejung Kim
Haewook Han
Jinhan Lee
356
1
0
14 Feb 2025
Hybrid Preference Optimization for Alignment: Provably Faster
  Convergence Rates by Combining Offline Preferences with Online Exploration
Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration
Avinandan Bose
Zhihan Xiong
Aadirupa Saha
S. Du
Maryam Fazel
412
5
0
13 Dec 2024
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
DOPL: Direct Online Preference Learning for Restless Bandits with Preference FeedbackInternational Conference on Learning Representations (ICLR), 2024
Efstathia Soufleri
Ujwal Dinesha
Debajoy Mukherjee
Jian Li
Srinivas Shakkottai
375
2
0
07 Oct 2024
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward InferenceInternational Conference on Learning Representations (ICLR), 2024
Qining Zhang
Lei Ying
OffRL
564
10
0
25 Sep 2024
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis
Qining Zhang
Honghao Wei
Lei Ying
OffRL
464
3
0
11 Jun 2024
Exploratory Preference Optimization: Harnessing Implicit
  Q*-Approximation for Sample-Efficient RLHF
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Tengyang Xie
Dylan J. Foster
Akshay Krishnamurthy
Corby Rosset
Ahmed Hassan Awadallah
Alexander Rakhlin
310
87
0
31 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is
  Implicitly an Adversarial Regularizer
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
432
93
0
26 May 2024
1
Page 1 of 1