ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.18438
  4. Cited By
Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
  Pessimism

Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

29 May 2023
Zihao Li
Zhuoran Yang
Mengdi Wang
    OffRL
ArXivPDFHTML

Papers citing "Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism"

39 / 39 papers shown
Title
On the Limitations of Steering in Language Model Alignment
On the Limitations of Steering in Language Model Alignment
Chebrolu Niranjan
Kokil Jaidka
G. Yeo
LLMSV
36
0
0
02 May 2025
Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing
Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing
Joowon Kim
Ziseok Lee
Donghyeon Cho
Sanghyun Jo
Y. Jung
Kyungsu Kim
Eunho Yang
DiffM
34
0
0
18 Apr 2025
Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval
Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval
Kidist Amde Mekonnen
Yubao Tang
Maarten de Rijke
60
0
0
07 Apr 2025
Generative Artificial Intelligence: Evolving Technology, Growing Societal Impact, and Opportunities for Information Systems Research
Veda C. Storey
Wei Thoo Yue
J. Leon Zhao
Roman Lukyanenko
43
0
0
25 Feb 2025
Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning
Leveraging Constrained Monte Carlo Tree Search to Generate Reliable Long Chain-of-Thought for Mathematical Reasoning
Qingwen Lin
Boyan Xu
Zijian Li
Z. Hao
Keli Zhang
Ruichu Cai
LRM
41
2
0
16 Feb 2025
Delta - Contrastive Decoding Mitigates Text Hallucinations in Large Language Models
Delta - Contrastive Decoding Mitigates Text Hallucinations in Large Language Models
Cheng Peng Huang
Hao-Yuan Chen
HILM
61
0
0
09 Feb 2025
A Survey of Research in Large Language Models for Electronic Design Automation
A Survey of Research in Large Language Models for Electronic Design Automation
Jingyu Pan
Guanglei Zhou
Chen-Chia Chang
Isaac Jacobson
Jiang Hu
Y. Chen
64
2
0
17 Jan 2025
A Theoretical Survey on Foundation Models
A Theoretical Survey on Foundation Models
Shi Fu
Yuzhu Chen
Yingjie Wang
Dacheng Tao
21
0
0
15 Oct 2024
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
Qining Zhang
Lei Ying
OffRL
35
1
0
25 Sep 2024
SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal
  Domain
SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
Pierre Colombo
T. Pires
Malik Boudiaf
Rui Melo
Dominic Culver
Sofia Morgado
Etienne Malaboeuf
Gabriel Hautreux
Johanne Charpentier
Michael Desa
ELM
AILaw
ALM
35
12
0
28 Jul 2024
A Teacher Is Worth A Million Instructions
A Teacher Is Worth A Million Instructions
Nikhil Kothari
Ravindra Nayak
Shreyas Shetty
Amey Patil
Nikesh Garera
ALM
19
0
0
27 Jun 2024
Uncertainty Aware Learning for Language Model Alignment
Uncertainty Aware Learning for Language Model Alignment
Yikun Wang
Rui Zheng
Liang Ding
Qi Zhang
Dahua Lin
Dacheng Tao
45
4
0
07 Jun 2024
Cooperative learning of Pl@ntNet's Artificial Intelligence algorithm:
  how does it work and how can we improve it?
Cooperative learning of Pl@ntNet's Artificial Intelligence algorithm: how does it work and how can we improve it?
Tanguy Lefort
Antoine Affouard
Benjamin Charlier
J. Lombardo
Mathias Chouet
Hervé Goëau
Joseph Salmon
P. Bonnet
Alexis Joly
21
0
0
05 Jun 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is
  Implicitly an Adversarial Regularizer
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
33
41
0
26 May 2024
A Unified Linear Programming Framework for Offline Reward Learning from
  Human Demonstrations and Feedback
A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback
Kihyun Kim
Jiawei Zhang
Asuman Ozdaglar
P. Parrilo
OffRL
23
1
0
20 May 2024
DPO Meets PPO: Reinforced Token Optimization for RLHF
DPO Meets PPO: Reinforced Token Optimization for RLHF
Han Zhong
Guhao Feng
Guhao Feng
Li Zhao
Di He
Jiang Bian
Liwei Wang
Jiang Bian
Liwei Wang
47
56
0
29 Apr 2024
Parameter Efficient Diverse Paraphrase Generation Using Sequence-Level
  Knowledge Distillation
Parameter Efficient Diverse Paraphrase Generation Using Sequence-Level Knowledge Distillation
Lasal Jayawardena
Prasan Yapa
BDL
21
1
0
19 Apr 2024
Dataset Reset Policy Optimization for RLHF
Dataset Reset Policy Optimization for RLHF
Jonathan D. Chang
Wenhao Zhan
Owen Oertell
Kianté Brantley
Dipendra Kumar Misra
Jason D. Lee
Wen Sun
OffRL
22
21
0
12 Apr 2024
GPTA: Generative Prompt Tuning Assistant for Synergistic Downstream
  Neural Network Enhancement with LLMs
GPTA: Generative Prompt Tuning Assistant for Synergistic Downstream Neural Network Enhancement with LLMs
Xiao Liu
Jiawei Zhang
22
0
0
29 Mar 2024
DP-Dueling: Learning from Preference Feedback without Compromising User
  Privacy
DP-Dueling: Learning from Preference Feedback without Compromising User Privacy
Aadirupa Saha
Hilal Asi
29
1
0
22 Mar 2024
Diffusion Model for Data-Driven Black-Box Optimization
Diffusion Model for Data-Driven Black-Box Optimization
Zihao Li
Hui Yuan
Kaixuan Huang
Chengzhuo Ni
Yinyu Ye
Minshuo Chen
Mengdi Wang
DiffM
19
9
0
20 Mar 2024
FARPLS: A Feature-Augmented Robot Trajectory Preference Labeling System
  to Assist Human Labelers' Preference Elicitation
FARPLS: A Feature-Augmented Robot Trajectory Preference Labeling System to Assist Human Labelers' Preference Elicitation
Hanfang Lyu
Yuanchen Bai
Xin Liang
Ujaan Das
Chuhan Shi
Leiliang Gong
Yingchi Li
Mingfei Sun
Ming Ge
Xiaojuan Ma
22
0
0
10 Mar 2024
Large Language Models for Simultaneous Named Entity Extraction and
  Spelling Correction
Large Language Models for Simultaneous Named Entity Extraction and Spelling Correction
Edward Whittaker
I. Kitagishi
13
4
0
01 Mar 2024
Is Offline Decision Making Possible with Only Few Samples? Reliable
  Decisions in Data-Starved Bandits via Trust Region Enhancement
Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement
Ruiqi Zhang
Yuexiang Zhai
Andrea Zanette
28
0
0
24 Feb 2024
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
  Diverse Human Preferences
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
Souradip Chakraborty
Jiahao Qiu
Hui Yuan
Alec Koppel
Furong Huang
Dinesh Manocha
Amrit Singh Bedi
Mengdi Wang
ALM
17
46
0
14 Feb 2024
Online Iterative Reinforcement Learning from Human Feedback with General
  Preference Model
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
Chen Ye
Wei Xiong
Yuheng Zhang
Nan Jiang
Tong Zhang
OffRL
36
9
0
11 Feb 2024
SymbolicAI: A framework for logic-based approaches combining generative
  models and solvers
SymbolicAI: A framework for logic-based approaches combining generative models and solvers
Marius-Constantin Dinu
Claudiu Leoveanu-Condrei
Markus Holzleitner
Werner Zellinger
Sepp Hochreiter
33
9
0
01 Feb 2024
Think Before You Duel: Understanding Complexities of Preference Learning
  under Constrained Resources
Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources
Rohan Deb
Aadirupa Saha
17
0
0
28 Dec 2023
Iterative Preference Learning from Human Feedback: Bridging Theory and
  Practice for RLHF under KL-Constraint
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
Wei Xiong
Hanze Dong
Chen Ye
Ziqi Wang
Han Zhong
Heng Ji
Nan Jiang
Tong Zhang
OffRL
22
154
0
18 Dec 2023
Toward General-Purpose Robots via Foundation Models: A Survey and
  Meta-Analysis
Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis
Yafei Hu
Quanting Xie
Vidhi Jain
Jonathan M Francis
Jay Patrikar
...
Xiaolong Wang
Sebastian A. Scherer
Z. Kira
Fei Xia
Yonatan Bisk
LM&Ro
AI4CE
30
62
0
14 Dec 2023
Beyond Text: Unveiling Multimodal Proficiency of Large Language Models
  with MultiAPI Benchmark
Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark
Xiao Liu
Jianfeng Lin
Jiawei Zhang
17
2
0
21 Nov 2023
Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive
  Learning for Code Generation
Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation
Hailin Chen
Amrita Saha
Steven C. H. Hoi
Shafiq R. Joty
26
6
0
28 Oct 2023
Sample Complexity of Preference-Based Nonparametric Off-Policy
  Evaluation with Deep Networks
Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks
Zihao Li
Xiang Ji
Minshuo Chen
Mengdi Wang
OffRL
8
0
0
16 Oct 2023
On the Provable Advantage of Unsupervised Pretraining
On the Provable Advantage of Unsupervised Pretraining
Jiawei Ge
Shange Tang
Jianqing Fan
Chi Jin
SSL
25
16
0
02 Mar 2023
Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
225
495
0
28 Sep 2022
Human-in-the-loop: Provably Efficient Preference-based Reinforcement
  Learning with General Function Approximation
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation
Xiaoyu Chen
Han Zhong
Zhuoran Yang
Zhaoran Wang
Liwei Wang
118
59
0
23 May 2022
Teaching language models to support answers with verified quotes
Teaching language models to support answers with verified quotes
Jacob Menick
Maja Trebacz
Vladimir Mikulik
John Aslanides
Francis Song
...
Mia Glaese
Susannah Young
Lucy Campbell-Gillingham
G. Irving
Nat McAleese
ELM
RALM
226
255
0
21 Mar 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Pessimistic Model-based Offline Reinforcement Learning under Partial
  Coverage
Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage
Masatoshi Uehara
Wen Sun
OffRL
91
144
0
13 Jul 2021
1