ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.12036
  4. Cited By
A General Theoretical Paradigm to Understand Learning from Human
  Preferences

A General Theoretical Paradigm to Understand Learning from Human Preferences

18 October 2023
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
ArXivPDFHTML

Papers citing "A General Theoretical Paradigm to Understand Learning from Human Preferences"

50 / 415 papers shown
Title
Preference Optimization for Combinatorial Optimization Problems
Preference Optimization for Combinatorial Optimization Problems
Mingjun Pan
Guanquan Lin
You-Wei Luo
Bin Zhu
Zhien Dai
Lijun Sun
Chun Yuan
18
0
0
13 May 2025
InfoPO: On Mutual Information Maximization for Large Language Model Alignment
InfoPO: On Mutual Information Maximization for Large Language Model Alignment
Teng Xiao
Zhen Ge
Sujay Sanghavi
Tian Wang
Julian Katz-Samuels
Marc Versage
Qingjun Cui
Trishul M. Chilimbi
24
0
0
13 May 2025
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Rei Higuchi
Taiji Suzuki
31
0
0
12 May 2025
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Zhuocheng Gong
Jian-Yu Guan
Wei Yu Wu
Huishuai Zhang
Dongyan Zhao
64
1
0
08 May 2025
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Jiarui Yao
Yifan Hao
Hanning Zhang
Hanze Dong
Wei Xiong
Nan Jiang
Tong Zhang
LRM
57
0
0
05 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
39
0
0
05 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
55
0
0
05 May 2025
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
Junlin Wang
Roy Xie
Shang Zhu
Jue Wang
Ben Athiwaratkun
Bhuwan Dhingra
S. Song
Ce Zhang
James Y. Zou
ALM
29
0
0
05 May 2025
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Meng-Hao Guo
Jiajun Xu
Yi Zhang
Jiaxi Song
Haoyang Peng
...
Yongming Rao
Houwen Peng
Han Hu
Gordon Wetzstein
Shi-Min Hu
ELM
LRM
57
0
0
04 May 2025
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach
Jiancong Xiao
Bojian Hou
Zhanliang Wang
Ruochen Jin
Q. Long
Weijie Su
Li Shen
30
0
0
04 May 2025
HyPerAlign: Hypotheses-driven Personalized Alignment
HyPerAlign: Hypotheses-driven Personalized Alignment
Cristina Garbacea
Chenhao Tan
49
0
0
29 Apr 2025
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
76
0
0
27 Apr 2025
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Kesen Zhao
B. Zhu
Qianru Sun
Hanwang Zhang
MLLM
LRM
86
0
0
25 Apr 2025
Private Federated Learning using Preference-Optimized Synthetic Data
Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou
Mei-Yu Wang
Yige Zhu
Daniel Lazar
Giulia Fanti
FedML
Presented at ResearchTrend Connect | FedML on 07 May 2025
57
0
0
23 Apr 2025
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
Junshu Pan
Wei Shen
Shulin Huang
Qiji Zhou
Yue Zhang
69
0
0
22 Apr 2025
Direct Advantage Regression: Aligning LLMs with Online AI Reward
Direct Advantage Regression: Aligning LLMs with Online AI Reward
Li He
He Zhao
Stephen Wan
Dadong Wang
Lina Yao
Tongliang Liu
27
0
0
19 Apr 2025
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Haojian Huang
Haodong Chen
Shengqiong Wu
Meng Luo
Jinlan Fu
Xinya Du
H. Zhang
Hao Fei
AI4TS
133
0
0
17 Apr 2025
Energy-Based Reward Models for Robust Language Model Alignment
Energy-Based Reward Models for Robust Language Model Alignment
Anamika Lochab
Ruqi Zhang
110
0
0
17 Apr 2025
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
Zhihao Xu
Yongqi Tong
Xin Zhang
Jun Zhou
Xiting Wang
35
0
0
15 Apr 2025
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Shuai Zhao
Linchao Zhu
Yi Yang
39
1
0
14 Apr 2025
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Yutao Mou
Yuxiao Luo
Shikun Zhang
Wei Ye
LLMSV
LRM
36
0
0
13 Apr 2025
Discriminator-Free Direct Preference Optimization for Video Diffusion
Discriminator-Free Direct Preference Optimization for Video Diffusion
Haoran Cheng
Qide Dong
Liang Peng
Zhizhou Sha
Weiguo Feng
Jinghui Xie
Zhao Song
Shilei Wen
Xiaofei He
Boxi Wu
VGen
102
0
0
11 Apr 2025
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
FangZhi Xu
Hang Yan
Chang Ma
Haiteng Zhao
Qiushi Sun
Kanzhi Cheng
Junxian He
Jun Liu
Zhiyong Wu
LRM
29
1
0
11 Apr 2025
2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization
2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization
Mengyang Li
Zhong Zhang
27
0
0
10 Apr 2025
Bridging the Gap Between Preference Alignment and Machine Unlearning
Bridging the Gap Between Preference Alignment and Machine Unlearning
Xiaohua Feng
Yuyuan Li
Huwei Ji
Jiaming Zhang
L. Zhang
Tianyu Du
Chaochao Chen
MU
38
0
0
09 Apr 2025
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
Jing Yao
Xiaoyuan Yi
Jindong Wang
Zhicheng Dou
Xing Xie
26
0
0
09 Apr 2025
Information-Theoretic Reward Decomposition for Generalizable RLHF
Information-Theoretic Reward Decomposition for Generalizable RLHF
Liyuan Mao
Haoran Xu
Amy Zhang
Weinan Zhang
Chenjia Bai
31
0
0
08 Apr 2025
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
Anna Goldie
Azalia Mirhoseini
Hao Zhou
Irene Cai
Christopher D. Manning
SyDa
OffRL
ReLM
LRM
109
3
0
07 Apr 2025
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Xuerui Su
Shufang Xie
Guoqing Liu
Yingce Xia
Renqian Luo
Peiran Jin
Zhiming Ma
Yue Wang
Zun Wang
Yuting Liu
LRM
27
1
0
06 Apr 2025
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Kai Ye
Hongyi Zhou
Jin Zhu
Francesco Quinzan
C. Shi
23
1
0
03 Apr 2025
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context
Nikhil Verma
Manasa Bharadwaj
36
0
0
03 Apr 2025
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Chaohu Liu
Tianyi Gui
Yu Liu
Linli Xu
VLM
AAML
68
1
0
02 Apr 2025
Entropy-Based Adaptive Weighting for Self-Training
Entropy-Based Adaptive Weighting for Self-Training
Xiaoxuan Wang
Yihe Deng
Mingyu Derek Ma
Wei Wang
LRM
47
0
0
31 Mar 2025
Learning a Canonical Basis of Human Preferences from Binary Ratings
Learning a Canonical Basis of Human Preferences from Binary Ratings
Kailas Vodrahalli
Wei Wei
James Y. Zou
41
0
0
31 Mar 2025
GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
Zhouhong Gu
Xingzhou Chen
Xiaoran Shi
Tao Wang
Suhang Zheng
Tianyu Li
Hongwei Feng
Yanghua Xiao
67
0
0
26 Mar 2025
Reasoning Beyond Limits: Advances and Open Problems for LLMs
Reasoning Beyond Limits: Advances and Open Problems for LLMs
M. Ferrag
Norbert Tihanyi
Merouane Debbah
ELM
OffRL
LRM
AI4CE
116
2
0
26 Mar 2025
RL-finetuning LLMs from on- and off-policy data with a single algorithm
RL-finetuning LLMs from on- and off-policy data with a single algorithm
Yunhao Tang
Taco Cohen
David W. Zhang
Michal Valko
Rémi Munos
OffRL
42
1
0
25 Mar 2025
Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners
Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners
Wen Zheng Terence Ng
Jianda Chen
Yuan Xu
Tianwei Zhang
37
0
0
24 Mar 2025
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
Y. Lu
Qichao Wang
H. Cao
Xierui Wang
Xiaoyin Xu
Min Zhang
59
0
0
24 Mar 2025
Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning
Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning
Ke Ji
Yixin Lian
Linxu Li
Jingsheng Gao
Weiyuan Li
Bin Dai
34
0
0
22 Mar 2025
Capturing Individual Human Preferences with Reward Features
Capturing Individual Human Preferences with Reward Features
André Barreto
Vincent Dumoulin
Yiran Mao
Nicolas Perez-Nieves
Bobak Shahriari
Yann Dauphin
Doina Precup
Hugo Larochelle
ALM
62
1
0
21 Mar 2025
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO
L. Zhang
Chen Liu
C. Xu
Kai Hu
Donghao Luo
Chengjie Wang
Yanwei Fu
Yuan Yao
42
0
0
21 Mar 2025
Cultural Alignment in Large Language Models Using Soft Prompt Tuning
Cultural Alignment in Large Language Models Using Soft Prompt Tuning
Reem I. Masoud
Martin Ferianc
Philip C. Treleaven
Miguel R. D. Rodrigues
ALM
49
0
0
20 Mar 2025
InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization
InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization
Yunan Wang
Jijie Li
Bo Zhang
Liangdong Wang
Guang Liu
58
0
0
20 Mar 2025
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs
Nicolas Le Roux
Marc G. Bellemare
Jonathan Lebensold
Arnaud Bergeron
Joshua Greaves
Alex Fréchette
Carolyne Pelletier
Eric Thibodeau-Laufer
Sándor Toth
Sam Work
OffRL
89
2
0
18 Mar 2025
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Songjun Tu
Jiahao Lin
Xiangyu Tian
Qichao Zhang
Linjing Li
...
Nan Xu
Wei He
Xiangyuan Lan
D. Jiang
Dongbin Zhao
LRM
44
2
0
17 Mar 2025
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model
Qiyuan Deng
X. Bai
Kehai Chen
Yaowei Wang
Liqiang Nie
Min Zhang
OffRL
59
0
0
13 Mar 2025
Preference-Based Alignment of Discrete Diffusion Models
Preference-Based Alignment of Discrete Diffusion Models
Umberto Borso
Davide Paglieri
Jude Wells
Tim Rocktaschel
57
1
0
11 Mar 2025
Robust Multi-Objective Controlled Decoding of Large Language Models
Seongho Son
William Bankes
Sangwoong Yoon
Shyam Sundhar Ramesh
Xiaohang Tang
Ilija Bogunovic
39
0
0
11 Mar 2025
RePO: ReLU-based Preference Optimization
Junkang Wu
Kexin Huang
Xue Wang
Jinyang Gao
Bolin Ding
Jiancan Wu
Xiangnan He
X. Wang
71
0
0
10 Mar 2025
123456789
Next