ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.00617
  4. Cited By
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

30 June 2024
Yuheng Zhang
Dian Yu
Baolin Peng
Linfeng Song
Ye Tian
Mingyue Huo
Nan Jiang
Haitao Mi
Dong Yu
ArXivPDFHTML

Papers citing "Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning"

13 / 13 papers shown
Title
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Kai Ye
Hongyi Zhou
Jin Zhu
Francesco Quinzan
C. Shi
20
0
0
03 Apr 2025
Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models
Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models
Xu Chu
Zhixin Zhang
Tianyu Jia
Yujie Jin
72
0
0
25 Feb 2025
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Yuheng Zhang
Dian Yu
Tao Ge
Linfeng Song
Zhichen Zeng
Haitao Mi
Nan Jiang
Dong Yu
56
1
0
24 Feb 2025
Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment
Faster WIND: Accelerating Iterative Best-of-NNN Distillation for LLM Alignment
Tong Yang
Jincheng Mei
H. Dai
Zixin Wen
Shicong Cen
Dale Schuurmans
Yuejie Chi
Bo Dai
36
4
0
20 Feb 2025
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Yongtao Wu
Luca Viano
Yihang Chen
Zhenyu Zhu
Kimon Antonakopoulos
Quanquan Gu
V. Cevher
49
0
0
18 Feb 2025
Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment
Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment
Mingzhi Wang
Chengdong Ma
Qizhi Chen
Linjian Meng
Yang Han
Jiancong Xiao
Zhaowei Zhang
Jing Huo
Weijie Su
Yaodong Yang
30
4
0
22 Oct 2024
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
Guanlin Liu
Kaixuan Ji
Ning Dai
Zheng Wu
Chen Dun
Q. Gu
Lin Yan
Quanquan Gu
Lin Yan
OffRL
LRM
46
8
0
11 Oct 2024
Direct Nash Optimization: Teaching Language Models to Self-Improve with
  General Preferences
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Corby Rosset
Ching-An Cheng
Arindam Mitra
Michael Santacroce
Ahmed Hassan Awadallah
Tengyang Xie
144
113
0
04 Apr 2024
RewardBench: Evaluating Reward Models for Language Modeling
RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert
Valentina Pyatkin
Jacob Morrison
Lester James Validad Miranda
Bill Yuchen Lin
...
Sachin Kumar
Tom Zick
Yejin Choi
Noah A. Smith
Hanna Hajishirzi
ALM
68
210
0
20 Mar 2024
Stabilizing RLHF through Advantage Model and Selective Rehearsal
Stabilizing RLHF through Advantage Model and Selective Rehearsal
Baolin Peng
Linfeng Song
Ye Tian
Lifeng Jin
Haitao Mi
Dong Yu
25
17
0
18 Sep 2023
Efficient Phi-Regret Minimization in Extensive-Form Games via Online
  Mirror Descent
Efficient Phi-Regret Minimization in Extensive-Form Games via Online Mirror Descent
Yu Bai
Chi Jin
Song Mei
Ziang Song
Tiancheng Yu
OffRL
46
18
0
30 May 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
No-Regret Learning in Time-Varying Zero-Sum Games
No-Regret Learning in Time-Varying Zero-Sum Games
Mengxiao Zhang
Peng Zhao
Haipeng Luo
Zhi-Hua Zhou
21
38
0
30 Jan 2022
1