ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.00675
  4. Cited By
Self-Play Preference Optimization for Language Model Alignment
v1v2v3v4 (latest)

Self-Play Preference Optimization for Language Model Alignment

1 May 2024
Yue Wu
Zhiqing Sun
Huizhuo Yuan
Kaixuan Ji
Yiming Yang
Quanquan Gu
ArXiv (abs)PDFHTMLHuggingFace (28 upvotes)Github (560★)

Papers citing "Self-Play Preference Optimization for Language Model Alignment"

50 / 124 papers shown
Title
MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs
MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs
Huining Yuan
Zelai Xu
Zheyue Tan
Xiangmin Yi
Mo Guang
...
Xinlei Chen
Bo Zhao
Xiao-Ping Zhang
Chao Yu
Yu Wang
LLMAGLRM
89
0
0
17 Oct 2025
Generative AI for Biosciences: Emerging Threats and Roadmap to Biosecurity
Generative AI for Biosciences: Emerging Threats and Roadmap to Biosecurity
Zaixi Zhang
Souradip Chakraborty
Amrit Singh Bedi
Emilin Mathew
Varsha Saravanan
...
Eric Xing
R. Altman
George Church
M. Y. Wang
Mengdi Wang
SILM
307
0
0
13 Oct 2025
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng
Guangyi Chen
Tianpei Gu
Lingjing Kong
Yan Li
Zeyu Tang
Kun Zhang
104
1
0
12 Oct 2025
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
Rui Li
Zeyu Zhang
Xiaohe Bo
Zihang Tian
Xu Chen
Quanyu Dai
Zhenhua Dong
Ruiming Tang
RALM
120
0
0
07 Oct 2025
Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse
Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse
Yuheng Zhang
Wenlin Yao
Changlong Yu
Yao Liu
Qingyu Yin
Bing Yin
Hyokun Yun
Lihong Li
73
0
0
30 Sep 2025
Multiplayer Nash Preference Optimization
Multiplayer Nash Preference Optimization
Fang Wu
X. Y. Huang
Weihao Xuan
Zhiwei Zhang
Yijia Xiao
...
Xiaomin Li
Bing Hu
Peng Xia
Jure Leskovec
Yejin Choi
84
1
0
27 Sep 2025
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models
Xuanming Zhang
Yuxuan Chen
Min-Hsuan Yeh
Yixuan Li
LRM
168
2
0
27 Sep 2025
Agentic Reinforcement Learning with Implicit Step Rewards
Agentic Reinforcement Learning with Implicit Step Rewards
Xiaoqian Liu
Ke Wang
Yuchuan Wu
Fei Huang
Y. Li
Junge Zhang
Jianbin Jiao
OffRL
118
0
0
23 Sep 2025
Language Self-Play For Data-Free Training
Language Self-Play For Data-Free Training
Jakub Grudzien Kuba
Mengting Gu
Qi Ma
Yuandong Tian
Vijai Mohan
SyDa
161
11
0
09 Sep 2025
SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning
SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning
Yuhao Zhang
Shaoming Duan
Jinhang Su
Chuanyi Liu
Peiyi Han
SyDa
106
0
0
04 Sep 2025
Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
Understanding Reinforcement Learning for Model Training, and future directions with GRAPE
Rohit Patel
OffRL
148
0
0
02 Sep 2025
Improving Large Vision and Language Models by Learning from a Panel of Peers
Improving Large Vision and Language Models by Learning from a Panel of Peers
J. Hernandez
Jing Shi
Simon Jenni
Vicente Ordonez
Kushal Kafle
80
1
0
01 Sep 2025
Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment
Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment
Zetian Sun
Dongfang Li
Baotian Hu
48
0
0
14 Aug 2025
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Miaosen Zhang
Ziqiang Xu
Jialiang Zhu
Qi Dai
Kai Qiu
...
Chong Luo
Tianyi Chen
Justin Wagle
Tim Franklin
Baining Guo
LRM
144
8
0
31 Jul 2025
The Hidden Link Between RLHF and Contrastive Learning
The Hidden Link Between RLHF and Contrastive Learning
Xufei Lv
Kehai Chen
Haoyuan Sun
X. Bai
Min Zhang
Houde Liu
Kehai Chen
130
2
0
27 Jun 2025
Rethinking DPO: The Role of Rejected Responses in Preference Misalignment
Rethinking DPO: The Role of Rejected Responses in Preference Misalignment
Jay Hyeon Cho
JunHyeok Oh
Myunsoo Kim
Byung-Jun Lee
166
3
0
15 Jun 2025
Reinforce LLM Reasoning through Multi-Agent Reflection
Yurun Yuan
Tengyang Xie
LRM
189
16
0
10 Jun 2025
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Mickel Liu
L. Jiang
Yancheng Liang
S. Du
Yejin Choi
Tim Althoff
Natasha Jaques
AAMLLRM
207
11
0
09 Jun 2025
Debiasing Online Preference Learning via Preference Feature Preservation
Debiasing Online Preference Learning via Preference Feature PreservationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Dongyoung Kim
Jinsung Yoon
Jinwoo Shin
Jaehyung Kim
142
0
0
06 Jun 2025
Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
Rudransh Agnihotri
Ananya Pandey
OffRLALM
175
1
0
06 Jun 2025
Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
Reshaping Reasoning in LLMs: A Theoretical Analysis of RL Training Dynamics through Pattern Selection
Xingwu Chen
Tianle Li
Difan Zou
LRM
312
1
0
05 Jun 2025
SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat
SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat
Yuru Jiang
Wenxuan Ding
Shangbin Feng
Greg Durrett
Yulia Tsvetkov
221
2
0
05 Jun 2025
Doubly Robust Alignment for Large Language Models
Doubly Robust Alignment for Large Language Models
Erhan Xu
Kai Ye
Hongyi Zhou
Luhan Zhu
Francesco Quinzan
Chengchun Shi
256
2
0
01 Jun 2025
Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Paul Gölz
Nika Haghtalab
Kunhe Yang
147
6
0
29 May 2025
Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching
Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching
Zhekun Shi
Kaizhao Liu
Qi Long
Weijie J. Su
Jiancong Xiao
139
6
0
27 May 2025
Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment
SquareχχχPO: Differentially Private and Robust χ2χ^2χ2-Preference Optimization in Offline Direct Alignment
Xingyu Zhou
Yulian Wu
Wenqian Weng
Francesco Orabona
273
0
0
27 May 2025
Accelerating RL for LLM Reasoning with Optimal Advantage Regression
Accelerating RL for LLM Reasoning with Optimal Advantage Regression
Kianté Brantley
Mingyu Chen
Zhaolin Gao
Jason D. Lee
Wen Sun
Wenhao Zhan
Xuezhou Zhang
OffRLLRM
213
9
0
27 May 2025
Lifelong Safety Alignment for Language Models
Lifelong Safety Alignment for Language Models
Haoyu Wang
Zeyu Qin
Yifei Zhao
C. Du
Min Lin
Xueqian Wang
Tianyu Pang
KELMCLL
219
5
0
26 May 2025
Accelerating Nash Learning from Human Feedback via Mirror Prox
Accelerating Nash Learning from Human Feedback via Mirror Prox
D. Tiapkin
Daniele Calandriello
Denis Belomestny
Eric Moulines
Alexey Naumov
Kashif Rasul
Michal Valko
Pierre Ménard
166
2
0
26 May 2025
Large Language Models for Planning: A Comprehensive and Systematic Survey
Large Language Models for Planning: A Comprehensive and Systematic Survey
Pengfei Cao
Tianyi Men
Wencan Liu
Jingwen Zhang
Xuzhao Li
Xixun Lin
Dianbo Sui
Yanan Cao
Kang Liu
Jun Zhao
LLMAGLM&RoOffRLELMLRM
323
11
0
26 May 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Yunxin Li
Xinyu Chen
Zitao Li
Zhenyu Liu
L. Wang
Tong Lu
Baotian Hu
Min Zhang
OffRLLRM
351
7
0
25 May 2025
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Optimal Transport-Based Token Weighting scheme for Enhanced Preference OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Meng Li
Guangda Huzhang
Haibo Zhang
Xiting Wang
Anxiang Zeng
175
1
0
24 May 2025
From Evaluation to Defense: Advancing Safety in Video Large Language Models
From Evaluation to Defense: Advancing Safety in Video Large Language Models
Yiwei Sun
Peiqi Jiang
Chuanbin Liu
Luohao Lin
Zhiying Lu
Hongtao Xie
165
1
0
22 May 2025
MPO: Multilingual Safety Alignment via Reward Gap Optimization
MPO: Multilingual Safety Alignment via Reward Gap OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Weixiang Zhao
Yulin Hu
Yang Deng
Tongtong Wu
Wenxuan Zhang
...
An Zhang
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
236
6
0
22 May 2025
Mutual-Taught for Co-adapting Policy and Reward Models
Mutual-Taught for Co-adapting Policy and Reward ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Tianyuan Shi
Canbin Huang
Fanqi Wan
Longguang Zhong
Ziyi Yang
Weizhou Shen
Xiaojun Quan
Ming Yan
214
1
0
17 May 2025
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Rei Higuchi
Taiji Suzuki
291
2
0
12 May 2025
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Xiaochen Li
Jiajie Jin
Guanting Dong
Hongjin Qian
Yutao Zhu
Yongkang Wu
Ji-Rong Wen
Zhicheng Dou
LLMAGLRM
379
141
0
30 Apr 2025
Anyprefer: An Agentic Framework for Preference Data Synthesis
Anyprefer: An Agentic Framework for Preference Data SynthesisInternational Conference on Learning Representations (ICLR), 2025
Yiyang Zhou
Zhaoxiang Wang
Tianle Wang
Shangyu Xing
Peng Xia
...
Chetan Bansal
Weitong Zhang
Ying Wei
Joey Tianyi Zhou
Huaxiu Yao
348
10
0
27 Apr 2025
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
Jiaqi Chen
Bang Zhang
Ruotian Ma
Peisong Wang
Xiaodan Liang
Zhaopeng Tu
Xuzhao Li
Kwan-Yee K. Wong
LLMAGReLMLRM
338
13
0
27 Apr 2025
Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
Jinqiao Wang
Jin Jiang
Yang Liu
Hao Fei
Xunliang Cai
LRM
222
0
0
18 Apr 2025
Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment
Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment
Wenrui Cai
Chengyu Wang
Junbing Yan
Jun Huang
Xiangzhong Fang
LRM
222
2
0
14 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
272
26
0
12 Apr 2025
Bridging the Gap Between Preference Alignment and Machine Unlearning
Bridging the Gap Between Preference Alignment and Machine Unlearning
Xiaohua Feng
Yuyuan Li
Huwei Ji
Jiaming Zhang
Lulu Zhang
Xuhong Zhang
Chaochao Chen
MU
170
2
0
09 Apr 2025
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Kai Ye
Hongyi Zhou
Jin Zhu
Francesco Quinzan
C. Shi
327
4
0
03 Apr 2025
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Debiasing Multimodal Large Language Models via Noise-Aware Preference OptimizationComputer Vision and Pattern Recognition (CVPR), 2025
Zefeng Zhang
Hengzhu Tang
Shuaiyi Nie
Ying Tai
Yiming Ren
Zhenyang Li
Dawei Yin
Duohe Ma
Tingwen Liu
236
7
0
23 Mar 2025
Robust Multi-Objective Controlled Decoding of Large Language Models
Seongho Son
William Bankes
Sangwoong Yoon
Shyam Sundhar Ramesh
Xiaohang Tang
Ilija Bogunovic
260
5
0
11 Mar 2025
RePO: Understanding Preference Learning Through ReLU-Based Optimization
RePO: Understanding Preference Learning Through ReLU-Based Optimization
Junkang Wu
Kexin Huang
Qingsong Wen
Jinyang Gao
Bolin Ding
Jiancan Wu
Xiangnan He
Xiang Wang
225
3
0
10 Mar 2025
SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
Kejia Chen
Jiawen Zhang
Jiacong Hu
Jiazhen Yang
Jian Lou
Zunlei Feng
Weilong Dai
268
1
0
06 Mar 2025
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Mahfuz Ahmed Anik
Abdur Rahman
Azmine Toushik Wasi
Md Manjurul Ahsan
279
13
0
05 Mar 2025
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
Taneesh Gupta
Rahul Madhavan
Xuchao Zhang
Chetan Bansal
Saravan Rajmohan
262
0
0
25 Feb 2025
123
Next