ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.14367
  4. Cited By
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy
  Data
v1v2v3 (latest)

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

22 April 2024
Fahim Tajwar
Anika Singh
Archit Sharma
Rafael Rafailov
Jeff Schneider
Tengyang Xie
Stefano Ermon
Chelsea Finn
Aviral Kumar
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github (907★)

Papers citing "Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data"

50 / 87 papers shown
Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
Chengzhi Yu
Yifan Xu
Yifan Chen
Wenyi Zhang
MLLMOffRL
336
2
0
30 Nov 2025
A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs
A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs
Quan-Wu Xiao
Tianyi Chen
OffRL
280
0
0
26 Nov 2025
STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking Models
STORE: Semantic Tokenization, Orthogonal Rotation and Efficient Attention for Scaling Up Ranking ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Y. Xu
Chaofan Fan
J. Hu
Yu Zhang
Zeng Xiaoyi
J. Zhang
200
1
0
24 Nov 2025
Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation
Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation
Ruojun Xu
Yu Kai
Xuhua Ren
Jiaxiang Cheng
Bing Ma
Tianxiang Zheng
Qinhlin Lu
EGVM
224
1
0
24 Nov 2025
Specification, Application, and Operationalization of a Metamodel of Fairness
Specification, Application, and Operationalization of a Metamodel of Fairness
Julian Alfredo Mendez
T. Kampik
248
1
0
14 Nov 2025
AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment
AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment
Ruibo Deng
Duanyu Feng
Wenqiang Lei
242
0
0
12 Nov 2025
Why DPO is a Misspecified Estimator and How to Fix It
Why DPO is a Misspecified Estimator and How to Fix It
Aditya Gopalan
Sayak Ray Chowdhury
Debangshu Banerjee
251
3
0
23 Oct 2025
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Howard Chen
Noam Razin
Karthik Narasimhan
Danqi Chen
CLLKELM
505
37
0
21 Oct 2025
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
Rui Li
Zeyu Zhang
Xiaohe Bo
Zihang Tian
Xu Chen
Quanyu Dai
Zhenhua Dong
Ruiming Tang
RALM
206
0
0
07 Oct 2025
How Well Can Preference Optimization Generalize Under Noisy Feedback?
How Well Can Preference Optimization Generalize Under Noisy Feedback?
Shawn Im
Yixuan Li
297
2
0
01 Oct 2025
One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient
One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient
Rui Ming
Haoyuan Wu
Shoubo Hu
Zhuolun He
Bei Yu
OffRLLRM
241
6
0
30 Sep 2025
A Measurement Study of Model Context Protocol Ecosystem
A Measurement Study of Model Context Protocol Ecosystem
Hechuan Guo
Yongle Hao
Yue Zhang
Minghui Xu
Peizhuo Lyu
Jiezhi Chen
Xiuzhen Cheng
359
9
0
29 Sep 2025
General Exploratory Bonus for Optimistic Exploration in RLHF
General Exploratory Bonus for Optimistic Exploration in RLHF
W. Li
Changdae Oh
Yixuan Li
AI4CE
369
1
0
27 Sep 2025
AI Kill Switch for malicious web-based LLM agent
AI Kill Switch for malicious web-based LLM agent
Sechan Lee
Sangdon Park
LLMAGAAML
198
1
0
26 Sep 2025
SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation
SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation
Xiaqiang Tang
Yi Wang
Keyu Hu
Rui Xu
Chuang Li
Weigao Sun
Jian Li
Sihong Xie
RALM
238
1
0
24 Aug 2025
What Matters in Data for DPO?
What Matters in Data for DPO?
Yu Pan
Zhongze Cai
Guanting Chen
Huaiyang Zhong
Chonghuan Wang
470
6
0
23 Aug 2025
Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?
Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?
Zetian Sun
Dongfang Li
Baotian Hu
Baotian Hu
Min Zhang
197
0
0
14 Aug 2025
Sample-efficient LLM Optimization with Reset Replay
Sample-efficient LLM Optimization with Reset Replay
Zichuan Liu
Jinyu Wang
Lei Song
Jiang Bian
OffRL
280
2
0
08 Aug 2025
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Guangchen Lan
Sipeng Zhang
Tianle Wang
Yuwei Zhang
Daoan Zhang
Xinpeng Wei
Xiaoman Pan
Hongming Zhang
Dong-Jun Han
Christopher G. Brinton
526
10
0
27 Jul 2025
The Hidden Link Between RLHF and Contrastive Learning
The Hidden Link Between RLHF and Contrastive Learning
Xufei Lv
Kehai Chen
Haoyuan Sun
X. Bai
Min Zhang
Houde Liu
Kehai Chen
287
4
0
27 Jun 2025
Cognitive models can reveal interpretable value trade-offs in language models
Cognitive models can reveal interpretable value trade-offs in language models
Sonia K. Murthy
Rosie Zhao
Jennifer Hu
Sham Kakade
Markus Wulfmeier
Peng Qian
Tomer Ullman
379
0
0
25 Jun 2025
Rethinking DPO: The Role of Rejected Responses in Preference Misalignment
Rethinking DPO: The Role of Rejected Responses in Preference Misalignment
Jay Hyeon Cho
JunHyeok Oh
Myunsoo Kim
Byung-Jun Lee
296
8
0
15 Jun 2025
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Rajagopal Setlur
Matthew Y. R. Yang
Charlie Snell
Jeremy Greer
Ian Wu
Virginia Smith
Max Simchowitz
Aviral Kumar
LRM
361
45
0
10 Jun 2025
Reinforce LLM Reasoning through Multi-Agent Reflection
Yurun Yuan
Tengyang Xie
LRM
382
25
0
10 Jun 2025
Explicit Preference Optimization: No Need for an Implicit Reward Model
Explicit Preference Optimization: No Need for an Implicit Reward Model
Xiangkun Hu
Lemin Kong
Tong He
David Wipf
214
4
0
09 Jun 2025
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Junhong Shen
Hao Bai
Lunjun Zhang
Yifei Zhou
Amrith Rajagopal Setlur
...
Diego Caples
Nan Jiang
Tong Zhang
Ameet Talwalkar
Aviral Kumar
LLMAGLRM
365
28
0
09 Jun 2025
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Mickel Liu
L. Jiang
Yancheng Liang
S. Du
Yejin Choi
Tim Althoff
Natasha Jaques
AAMLLRM
385
23
0
09 Jun 2025
Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study
Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study
Yongyu Mu
Jiali Zeng
Bei Li
Xinyan Guan
Fandong Meng
Jie Zhou
Tong Xiao
Jingbo Zhu
OffRLLRM
449
3
0
05 Jun 2025
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Kyung Rok Kim
Yumo Bai
Chonghuan Wang
Guanting Chen
383
0
0
03 Jun 2025
Can Large Reasoning Models Self-Train?
Can Large Reasoning Models Self-Train?
Sheikh Shafayat
Fahim Tajwar
Ruslan Salakhutdinov
J. Schneider
Andrea Zanette
ReLMOffRLLRM
615
37
0
27 May 2025
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Ruizhe Shi
Minhak Song
Runlong Zhou
Zihan Zhang
Maryam Fazel
S. S. Du
500
8
0
26 May 2025
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen
Xiaopeng Li
Zhiyu Li
Xi Chen
Tianyi Lin
619
0
0
16 May 2025
InfoPO: On Mutual Information Maximization for Large Language Model Alignment
InfoPO: On Mutual Information Maximization for Large Language Model AlignmentNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Teng Xiao
Zhen Ge
Sujay Sanghavi
Tian Wang
Julian Katz-Samuels
Marc Versage
Qingjun Cui
Trishul Chilimbi
510
4
0
13 May 2025
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via ααα-βββ-Divergence
Guanghui Wang
Zhiyong Yang
Liang Luo
Shi Wang
Qianqian Xu
Qingming Huang
772
11
0
07 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
426
3
0
05 May 2025
Semantic Probabilistic Control of Language Models
Semantic Probabilistic Control of Language Models
Kareem Ahmed
Catarina G Belém
Padhraic Smyth
Sameer Singh
348
4
0
04 May 2025
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Jiancong Xiao
Bojian Hou
Zhanliang Wang
Ruochen Jin
Q. Long
Weijie Su
Li Shen
581
22
0
04 May 2025
DRAGON: Distributional Rewards Optimize Diffusion Generative Models
DRAGON: Distributional Rewards Optimize Diffusion Generative Models
Yatong Bai
Jonah Casebeer
Somayeh Sojoudi
Nicholas J. Bryan
DiffMVLM
620
6
0
21 Apr 2025
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
Syntactic and Semantic Control of Large Language Models via Sequential Monte CarloInternational Conference on Learning Representations (ICLR), 2025
João Loula
Benjamin LeBrun
Li Du
Ben Lipkin
Clemente Pasti
...
Ryan Cotterel
Vikash K. Mansinghka
Alexander K. Lew
Tim Vieira
Timothy J. O'Donnell
684
34
0
17 Apr 2025
Efficient Construction of Model Family through Progressive Training Using Model Expansion
Efficient Construction of Model Family through Progressive Training Using Model Expansion
Kazuki Yano
Sho Takase
Sosuke Kobayashi
Shun Kiyono
Jun Suzuki
287
7
0
01 Apr 2025
RL-finetuning LLMs from on- and off-policy data with a single algorithm
RL-finetuning LLMs from on- and off-policy data with a single algorithm
Yunhao Tang
Taco Cohen
David W. Zhang
Michal Valko
Rémi Munos
OffRL
436
12
0
25 Mar 2025
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Nicolas Le Roux
Marc G. Bellemare
Jonathan Lebensold
Arnaud Bergeron
Joshua Greaves
Alex Fréchette
Carolyne Pelletier
Eric Thibodeau-Laufer
Sándor Toth
Sam Work
OffRL
578
44
0
18 Mar 2025
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
Jongwoo Ko
Tianyi Chen
Sungnyun Kim
Tianyu Ding
Luming Liang
Ilya Zharkov
Se-Young Yun
VLM
1.1K
24
0
10 Mar 2025
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Mahfuz Ahmed Anik
Abdur Rahman
Azmine Toushik Wasi
Md Manjurul Ahsan
376
18
0
05 Mar 2025
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Gokul Swamy
Sanjiban Choudhury
Wen Sun
Zhiwei Steven Wu
J. Andrew Bagnell
OffRL
487
52
0
03 Mar 2025
Training a Generally Curious Agent
Training a Generally Curious Agent
Fahim Tajwar
Yiding Jiang
Abitha Thankaraj
Sumaita Sadia Rahman
J. Zico Kolter
Jeff Schneider
Ruslan Salakhutdinov
658
20
0
24 Feb 2025
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
SimPER: A Minimalist Approach to Preference Alignment without HyperparametersInternational Conference on Learning Representations (ICLR), 2025
Teng Xiao
Yige Yuan
Ziyang Chen
Mingxiao Li
Shangsong Liang
Zhaochun Ren
V. Honavar
729
27
0
21 Feb 2025
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
S2^22R: Teaching LLMs to Self-verify and Self-correct via Reinforcement LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Ruotian Ma
Peisong Wang
Cheng Liu
Xingyan Liu
Jiaqi Chen
Bang Zhang
Xin Zhou
Nan Du
Jia Li
LRM
598
14
0
18 Feb 2025
Preference learning made easy: Everything should be understood through win rate
Preference learning made easy: Everything should be understood through win rate
Lily H. Zhang
Rajesh Ranganath
445
4
0
14 Feb 2025
Digi-Q: Learning Q-Value Functions for Training Device-Control Agents
Digi-Q: Learning Q-Value Functions for Training Device-Control Agents
Hao Bai
Yifei Zhou
Li Erran Li
Sergey Levine
Aviral Kumar
OffRL
363
16
0
13 Feb 2025
12
Next
Page 1 of 2