ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.12036
  4. Cited By
A General Theoretical Paradigm to Understand Learning from Human
  Preferences
v1v2 (latest)

A General Theoretical Paradigm to Understand Learning from Human Preferences

International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
18 October 2023
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
ArXiv (abs)PDFHTMLHuggingFace (16 upvotes)

Papers citing "A General Theoretical Paradigm to Understand Learning from Human Preferences"

50 / 579 papers shown
Mutual-Taught for Co-adapting Policy and Reward Models
Mutual-Taught for Co-adapting Policy and Reward ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Tianyuan Shi
Canbin Huang
Fanqi Wan
Longguang Zhong
Ziyi Yang
Weizhou Shen
Xiaojun Quan
Ming Yan
268
1
0
17 May 2025
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen
Xiaopeng Li
Zhiyu Li
Xi Chen
Tianyi Lin
507
0
0
16 May 2025
ShiQ: Bringing back Bellman to LLMs
ShiQ: Bringing back Bellman to LLMs
Pierre Clavier
Nathan Grinsztajn
Raphaël Avalos
Yannis Flet-Berliac
Irem Ergun
...
Eugene Tarassov
Olivier Pietquin
Pierre Harvey Richemond
Florian Strub
Matthieu Geist
OffRL
236
1
0
16 May 2025
SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
Huashan Sun
Shengyi Liao
Yansen Han
Yu Bai
Yang Gao
...
Weizhou Shen
Fanqi Wan
Ming Yan
J.N. Zhang
Fei Huang
585
1
0
16 May 2025
InfoPO: On Mutual Information Maximization for Large Language Model Alignment
InfoPO: On Mutual Information Maximization for Large Language Model AlignmentNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Teng Xiao
Zhen Ge
Sujay Sanghavi
Tian Wang
Julian Katz-Samuels
Marc Versage
Qingjun Cui
Trishul Chilimbi
391
3
0
13 May 2025
Preference Optimization for Combinatorial Optimization Problems
Preference Optimization for Combinatorial Optimization Problems
Mingjun Pan
Guanquan Lin
You-Wei Luo
Bin Zhu
Zhien Dai
Lijun Sun
Chun Yuan
266
4
0
13 May 2025
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Rei Higuchi
Taiji Suzuki
348
2
0
12 May 2025
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Zhuocheng Gong
Jian Guan
Wei Wu
Huishuai Zhang
Dongyan Zhao
341
4
0
08 May 2025
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
Junlin Wang
Roy Xie
Shang Zhu
Jue Wang
Ben Athiwaratkun
Bhuwan Dhingra
Shuaiwen Leon Song
Ce Zhang
James Zou
ALM
235
3
0
05 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
318
2
0
05 May 2025
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Jiarui Yao
Yifan Hao
Hanning Zhang
Hanze Dong
Wei Xiong
Nan Jiang
Tong Zhang
LRM
385
10
0
05 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
357
9
0
05 May 2025
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Meng-Hao Guo
Jiajun Xu
Yi Zhang
Jiaxi Song
Haoyang Peng
...
Yongming Rao
Houwen Peng
Han Hu
Gordon Wetzstein
Shi-Min Hu
ELMLRM
286
20
0
04 May 2025
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Jiancong Xiao
Bojian Hou
Zhanliang Wang
Ruochen Jin
Q. Long
Weijie Su
Li Shen
473
14
0
04 May 2025
HyPerAlign: Interpretable Personalized LLM Alignment via Hypothesis Generation
HyPerAlign: Interpretable Personalized LLM Alignment via Hypothesis Generation
Cristina Garbacea
Chenhao Tan
441
0
0
29 Apr 2025
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
319
0
0
27 Apr 2025
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Kesen Zhao
B. Zhu
Qianru Sun
Hanwang Zhang
MLLMLRM
423
13
0
25 Apr 2025
POPri: Private Federated Learning using Preference-Optimized Synthetic Data
POPri: Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou
Mei-Yu Wang
Yige Zhu
Daniel Lazar
Giulia Fanti
FedML
525
7
0
23 Apr 2025
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
Junshu Pan
Wei Shen
Shulin Huang
Qiji Zhou
Yue Zhang
332
5
0
22 Apr 2025
Direct Advantage Regression: Aligning LLMs with Online AI Reward
Direct Advantage Regression: Aligning LLMs with Online AI Reward
Li He
He Zhao
Stephen Wan
Dadong Wang
Lina Yao
Tongliang Liu
292
0
0
19 Apr 2025
Energy-Based Reward Models for Robust Language Model Alignment
Energy-Based Reward Models for Robust Language Model Alignment
Anamika Lochab
Ruqi Zhang
972
2
0
17 Apr 2025
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Haojian Huang
Haodong Chen
Shengqiong Wu
Meng Luo
Jinlan Fu
Xinya Du
Hao Zhang
Hao Fei
AI4TS
970
8
0
17 Apr 2025
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
Zhihao Xu
Yongqi Tong
Xin Zhang
Jun Zhou
Xiting Wang
213
1
0
15 Apr 2025
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Shuai Zhao
Linchao Zhu
Yi Yang
Yi Yang
448
4
0
14 Apr 2025
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Yutao Mou
Yuxiao Luo
Shikun Zhang
Wei Ye
LLMSVLRM
268
14
0
13 Apr 2025
Discriminator-Free Direct Preference Optimization for Video Diffusion
Discriminator-Free Direct Preference Optimization for Video Diffusion
Haoran Cheng
Qide Dong
Liang Peng
Zhizhou Sha
Weiguo Feng
Jinghui Xie
Zhao Song
Shilei Wen
Xiaofei He
Boxi Wu
VGen
846
2
0
11 Apr 2025
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
FangZhi Xu
Hang Yan
Chang Ma
Haiteng Zhao
Qiushi Sun
Kanzhi Cheng
Junxian He
Jun Liu
Zhiyong Wu
LRM
221
15
0
11 Apr 2025
2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization
2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization
Mengyang Li
Zhong Zhang
386
2
0
10 Apr 2025
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
Jing Yao
Xiaoyuan Yi
Jindong Wang
Zhicheng Dou
Xing Xie
172
4
0
09 Apr 2025
Bridging the Gap Between Preference Alignment and Machine Unlearning
Bridging the Gap Between Preference Alignment and Machine Unlearning
Xiaohua Feng
Yuyuan Li
Huwei Ji
Jiaming Zhang
Lulu Zhang
Xuhong Zhang
Chaochao Chen
MU
258
2
0
09 Apr 2025
Information-Theoretic Reward Decomposition for Generalizable RLHF
Information-Theoretic Reward Decomposition for Generalizable RLHF
Liyuan Mao
Haoran Xu
Amy Zhang
Weinan Zhang
Fuchun Sun
353
3
0
08 Apr 2025
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
Anna Goldie
Azalia Mirhoseini
Hao Zhou
Irene Cai
Christopher D. Manning
SyDaOffRLReLMLRM
467
28
0
07 Apr 2025
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Xuerui Su
Shufang Xie
Lingbo Li
Ziheng Lu
Renqian Luo
Peiran Jin
Zhiming Ma
Yue Wang
Guoqing Liu
Yuting Liu
LRM
454
8
0
06 Apr 2025
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Kai Ye
Hongyi Zhou
Jin Zhu
Francesco Quinzan
C. Shi
477
5
0
03 Apr 2025
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
Nikhil Verma
Manasa Bharadwaj
266
3
0
03 Apr 2025
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment
Yifan Wang
Runjin Chen
Bolian Li
David Cho
Yihe Deng
Ruqi Zhang
Tianlong Chen
Zhangyang Wang
A. Grama
Junyuan Hong
SyDa
279
4
0
03 Apr 2025
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Chaohu Liu
Tianyi Gui
Yu Liu
Linli Xu
VLMAAML
338
3
0
02 Apr 2025
Entropy-Based Adaptive Weighting for Self-Training
Entropy-Based Adaptive Weighting for Self-Training
Xiaoxuan Wang
Yihe Deng
Mingyu Derek Ma
Wei Wang
LRM
227
1
0
31 Mar 2025
Learning a Canonical Basis of Human Preferences from Binary Ratings
Learning a Canonical Basis of Human Preferences from Binary Ratings
Kailas Vodrahalli
Wei Wei
James Zou
239
1
0
31 Mar 2025
GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
GAPO: Learning Preferential Prompt through Generative Adversarial Policy OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhouhong Gu
Xingzhou Chen
Xiaoran Shi
Tao Wang
Suhang Zheng
Tianyu Li
Hongwei Feng
Yanghua Xiao
319
1
0
26 Mar 2025
Reasoning Beyond Limits: Advances and Open Problems for LLMs
Reasoning Beyond Limits: Advances and Open Problems for LLMsICT express (ICT Express), 2025
M. Ferrag
Norbert Tihanyi
Merouane Debbah
OffRLLRMAI4CEELM
811
18
0
26 Mar 2025
RL-finetuning LLMs from on- and off-policy data with a single algorithm
RL-finetuning LLMs from on- and off-policy data with a single algorithm
Yunhao Tang
Taco Cohen
David W. Zhang
Michal Valko
Rémi Munos
OffRL
370
9
0
25 Mar 2025
Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners
Latent Embedding Adaptation for Human Preference Alignment in Diffusion PlannersIEEE International Conference on Robotics and Automation (ICRA), 2025
Wen Zheng Terence Ng
Jianda Chen
Yuan Xu
Tianwei Zhang
370
0
0
24 Mar 2025
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
Brian Bartoldson
S. Venkatraman
James Diffenderfer
Moksh Jain
Tal Ben-Nun
Seanie Lee
Minsu Kim
J. Obando-Ceron
Yoshua Bengio
B. Kailkhura
OffRL
313
11
0
24 Mar 2025
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model AlignmentComputer Vision and Pattern Recognition (CVPR), 2025
Yaojie Lu
Qichao Wang
H. Cao
Xierui Wang
Xiaoyin Xu
Min Zhang
327
8
0
24 Mar 2025
Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning
Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Ke Ji
Yixin Lian
Linxu Li
Jingsheng Gao
Weiyuan Li
Bin Dai
254
14
0
22 Mar 2025
Capturing Individual Human Preferences with Reward Features
Capturing Individual Human Preferences with Reward Features
André Barreto
Vincent Dumoulin
Yiran Mao
Nicolas Perez-Nieves
Bobak Shahriari
Yann Dauphin
Doina Precup
Hugo Larochelle
ALM
246
4
0
21 Mar 2025
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO
Guang Dai
Chen Liu
C. Xu
Kai Hu
Donghao Luo
Chengjie Wang
Yanwei Fu
Xingtai Lv
261
1
0
21 Mar 2025
InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization
InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization
Yunan Wang
Jijie Li
Bo Zhang
Liangdong Wang
Guang Liu
234
3
0
20 Mar 2025
Cultural Alignment in Large Language Models Using Soft Prompt Tuning
Cultural Alignment in Large Language Models Using Soft Prompt Tuning
Reem I. Masoud
Martin Ferianc
Philip C. Treleaven
Miguel R. D. Rodrigues
ALM
241
3
0
20 Mar 2025
Previous
12345...101112
Next