ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.12036
  4. Cited By
A General Theoretical Paradigm to Understand Learning from Human
  Preferences
v1v2 (latest)

A General Theoretical Paradigm to Understand Learning from Human Preferences

International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
18 October 2023
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
ArXiv (abs)PDFHTMLHuggingFace (16 upvotes)

Papers citing "A General Theoretical Paradigm to Understand Learning from Human Preferences"

50 / 579 papers shown
Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory
Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory
Jiancong Xiao
Zhekun Shi
Kaizhao Liu
Q. Long
Weijie J. Su
227
4
0
14 Jun 2025
On a few pitfalls in KL divergence gradient estimation for RL
Yunhao Tang
Rémi Munos
246
9
0
11 Jun 2025
Reinforce LLM Reasoning through Multi-Agent Reflection
Yurun Yuan
Tengyang Xie
LRM
317
16
0
10 Jun 2025
Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling
Phuc Minh Nguyen
Ngoc-Hieu Nguyen
Duy Nguyen
Anji Liu
An Mai
Binh T. Nguyen
Daniel Sonntag
Khoa D. Doan
287
0
0
10 Jun 2025
ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization
Hee Suk Yoon
Eunseop Yoon
Mark Hasegawa-Johnson
Sungwoong Kim
Chang D. Yoo
250
0
0
10 Jun 2025
Preference-Driven Multi-Objective Combinatorial Optimization with Conditional Computation
Mingfeng Fan
Jianan Zhou
Yifeng Zhang
Yaoxin Wu
Jinbiao Chen
Guillaume Sartoretti
AI4CE
280
1
0
10 Jun 2025
Explicit Preference Optimization: No Need for an Implicit Reward Model
Explicit Preference Optimization: No Need for an Implicit Reward Model
Xiangkun Hu
Lemin Kong
Tong He
David Wipf
186
0
0
09 Jun 2025
Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization
Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization
Zixuan Huang
Yikun Ban
Lean Fu
Xiaojie Li
Zhongxiang Dai
Jianxin Li
Deqing Wang
347
2
0
08 Jun 2025
Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
Xingwu Chen
Tianle Li
Difan Zou
LRM
382
1
0
05 Jun 2025
Robust Preference Optimization via Dynamic Target Margins
Robust Preference Optimization via Dynamic Target MarginsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jie Sun
Junkang Wu
Jiancan Wu
Zhibo Zhu
Xingyu Lu
Jun Zhou
Lintao Ma
Xiang Wang
291
4
0
04 Jun 2025
Aligning Large Language Models with Implicit Preferences from User-Generated ContentAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhaoxuan Tan
Zheng Li
Tianyi Liu
Haodong Wang
Hyokun Yun
...
Yifan Gao
Ruijie Wang
Priyanka Nigam
Bing Yin
Meng Jiang
232
6
0
04 Jun 2025
Multi-objective Aligned Bidword Generation Model for E-commerce Search Advertising
Multi-objective Aligned Bidword Generation Model for E-commerce Search AdvertisingAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025
Zhenhui Liu
Chunyuan Yuan
Ming Pang
Zheng Fang
Li Yuan
Xue Jiang
Changping Peng
Zhangang Lin
Zheng Luo
Jingping Shao
200
1
0
04 Jun 2025
Provable Reinforcement Learning from Human Feedback with an Unknown Link Function
Provable Reinforcement Learning from Human Feedback with an Unknown Link Function
Qining Zhang
Lei Ying
252
0
0
03 Jun 2025
daDPO: Distribution-Aware DPO for Distilling Conversational Abilities
daDPO: Distribution-Aware DPO for Distilling Conversational AbilitiesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhengze Zhang
Shiqi Wang
Yiqun Shen
Simin Guo
Dahua Lin
Xiaoliang Wang
Nguyen Cam-Tu
Fei Tan
215
1
0
03 Jun 2025
Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences
Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences
Yunhong Lu
Qichao Wang
H. Cao
Xiaoyin Xu
Min Zhang
332
5
0
03 Jun 2025
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Understanding the Impact of Sampling Quality in Direct Preference Optimization
Kyung Rok Kim
Yumo Bai
Chonghuan Wang
Guanting Chen
277
0
0
03 Jun 2025
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context
Z. Ouyang
Qianlong Wen
Chunhui Zhang
Yanfang Ye
Soroush Vosoughi
HAI
231
1
0
02 Jun 2025
HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models
HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Songtao Jiang
Yan Zhang
Yeying Jin
Hongwei Wang
Y. Wu
Yang Feng
Jian Wu
Zuozhu Liu
214
3
0
01 Jun 2025
Doubly Robust Alignment for Large Language Models
Doubly Robust Alignment for Large Language Models
Erhan Xu
Kai Ye
Hongyi Zhou
Luhan Zhu
Francesco Quinzan
Chengchun Shi
306
4
0
01 Jun 2025
Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Shuyao Xu
Cheng Peng
Jiangxuan Long
Weidi Xu
Wei Chu
Yuan Qi
LRM
198
2
0
30 May 2025
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences
Soichiro Nishimori
Yu Zhang
Thanawat Lodkaew
Masashi Sugiyama
NoLa
262
2
0
30 May 2025
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Wenhan Yang
Spencer Stice
Ali Payani
Baharan Mirzasoleiman
MLLM
222
1
0
30 May 2025
Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
Seohyeong Lee
Eunwon Kim
Hwaran Lee
Buru Chang
315
1
0
29 May 2025
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Probability-Consistent Preference Optimization for Enhanced LLM ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yunqiao Yang
Houxing Ren
Zimu Lu
Ke Wang
Weikang Shi
A-Long Zhou
Junting Pan
Mingjie Zhan
Hongsheng Li
LRM
227
0
0
29 May 2025
Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Paul Gölz
Nika Haghtalab
Kunhe Yang
201
7
0
29 May 2025
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Kaiyang Guo
Yinchuan Li
Zhitang Chen
352
2
0
29 May 2025
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Yi Ding
Ruqi Zhang
ReLMLRMVLM
256
6
0
28 May 2025
Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data
Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data
Christopher Lee Lübbers
175
0
0
28 May 2025
SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training
SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training
Xiaomeng Yang
Zhiyu Tan
Junyan Wang
Zhijian Zhou
Hao Li
289
0
0
28 May 2025
Accelerating RL for LLM Reasoning with Optimal Advantage Regression
Accelerating RL for LLM Reasoning with Optimal Advantage Regression
Kianté Brantley
Mingyu Chen
Zhaolin Gao
Jason D. Lee
Wen Sun
Wenhao Zhan
Xuezhou Zhang
OffRLLRM
261
12
0
27 May 2025
Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching
Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching
Zhekun Shi
Kaizhao Liu
Qi Long
Weijie J. Su
Jiancong Xiao
172
7
0
27 May 2025
Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment
SquareχχχPO: Differentially Private and Robust χ2χ^2χ2-Preference Optimization in Offline Direct Alignment
Xingyu Zhou
Yulian Wu
Wenqian Weng
Francesco Orabona
321
0
0
27 May 2025
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Ruizhe Shi
Minhak Song
Runlong Zhou
Zihan Zhang
Maryam Fazel
S. S. Du
309
6
0
26 May 2025
Token-Importance Guided Direct Preference Optimization
Token-Importance Guided Direct Preference Optimization
Yang Ning
Lin Hai
Liu Yibo
Tian Baoliang
Liu Guoqing
Zhang Haijun
268
0
0
26 May 2025
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Yi Liu
Dianqing Liu
Mingye Zhu
Junbo Guo
Yongdong Zhang
Zhendong Mao
373
0
0
26 May 2025
Frictional Agent Alignment Framework: Slow Down and Don't Break Things
Frictional Agent Alignment Framework: Slow Down and Don't Break ThingsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Abhijnan Nath
Carine Graff
Andrei Bachinin
Nikhil Krishnaswamy
308
4
0
26 May 2025
Accelerating Nash Learning from Human Feedback via Mirror Prox
Accelerating Nash Learning from Human Feedback via Mirror Prox
D. Tiapkin
Daniele Calandriello
Denis Belomestny
Eric Moulines
Alexey Naumov
Kashif Rasul
Michal Valko
Pierre Ménard
243
3
0
26 May 2025
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
Geon-hyeong Kim
Youngsoo Jang
Yu Jin Kim
Byoungjip Kim
Honglak Lee
Kyunghoon Bae
Moontae Lee
257
17
0
26 May 2025
Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Token-level Accept or Reject: A Micro Alignment Approach for Large Language ModelsInternational Joint Conference on Artificial Intelligence (IJCAI), 2025
Y. Zhang
Yu Yu
Bo Tang
Yu Zhu
Chuxiong Sun
...
Jie Hu
Zipeng Xie
Zhiyu Li
Feiyu Xiong
Edward Chung
483
0
0
26 May 2025
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Optimal Transport-Based Token Weighting scheme for Enhanced Preference OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Meng Li
Guangda Huzhang
Haibo Zhang
Xiting Wang
Anxiang Zeng
244
1
0
24 May 2025
Rethinking Direct Preference Optimization in Diffusion Models
Rethinking Direct Preference Optimization in Diffusion Models
Junyong Kang
Seohyun Lim
Kyungjune Baek
Hyunjung Shim
1.0K
0
0
24 May 2025
Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Huayu Chen
Kaiwen Zheng
Qinsheng Zhang
Ganqu Cui
Yin Cui
Haotian Ye
Tsung-Yi Lin
Ming-Yu Liu
Jun Zhu
Haoxiang Wang
OffRLLRM
526
16
0
23 May 2025
Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
Yutong Chen
Jiandong Gao
Ji Wu
ALM
452
2
0
23 May 2025
MPO: Multilingual Safety Alignment via Reward Gap Optimization
MPO: Multilingual Safety Alignment via Reward Gap OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Weixiang Zhao
Yulin Hu
Yang Deng
Tongtong Wu
Wenxuan Zhang
...
An Zhang
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
316
7
0
22 May 2025
LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
Chaochen Gao
Xing Wu
Zijia Lin
Debing Zhang
Songlin Hu
SyDa
604
1
0
22 May 2025
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
Yurun Yuan
Fan Chen
Zeyu Jia
Alexander Rakhlin
Tengyang Xie
OffRL
362
1
0
21 May 2025
InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
Yanggan Gu
Zhaoyi Yan
Yuanyi Wang
Yiming Zhang
Qi Zhou
Leilei Gan
Hongxia Yang
300
2
0
20 May 2025
On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding
On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding
Haoyuan Wu
Rui Ming
Jilong Gao
Hangyu Zhao
Xueyi Chen
Yikai Yang
Haisheng Zheng
Zhuolun He
Bei Yu
303
2
0
19 May 2025
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Gang Li
Ming Lin
Tomer Galanti
Zhengzhong Tu
Tianbao Yang
488
8
0
18 May 2025
SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment
SGDPO: Self-Guided Direct Preference Optimization for Language Model AlignmentAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Wenqiao Zhu
Ji Liu
Lulu Wang
Jun Wu
Yulun Zhang
366
2
0
18 May 2025
Previous
123456...101112
Next