Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

24 November 2021

Papers citing "Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability"

32 / 32 papers shown

Title
Active Human Feedback Collection via Neural Contextual Dueling Bandits Arun Verma Xiaoqiang Lin Zhongxiang Dai Daniela Rus Bryan Kian Hsiang Low 42 0 0 16 Apr 2025
Online Clustering of Dueling Bandits Zhiyong Wang Jiahang Sun Mingze Kong Jize Xie Qinghua Hu J. C. Lui Zhongxiang Dai 85 0 0 04 Feb 2025
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback Guojun Xiong Ujwal Dinesha Debajoy Mukherjee Jian Li Srinivas Shakkottai 52 2 0 07 Oct 2024
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF Zhaolin Gao Wenhao Zhan Jonathan D. Chang Gokul Swamy Kianté Brantley Jason D. Lee Wen Sun OffRL 81 3 0 06 Oct 2024
Conversational Dueling Bandits in Generalized Linear Models Shuhua Yang Hui Yuan Xiaoying Zhang Mengdi Wang Hong Zhang Huazheng Wang 41 1 0 26 Jul 2024
Neural Dueling Bandits: Preference-Based Optimization with Human Feedback Arun Verma Zhongxiang Dai Xiaoqiang Lin Patrick Jaillet K. H. Low 39 5 0 24 Jul 2024
Bandits with Preference Feedback: A Stackelberg Game Perspective Barna Pásztor Parnian Kassraie Andreas Krause 42 2 0 24 Jun 2024
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis Qining Zhang Honghao Wei Lei Ying OffRL 67 1 0 11 Jun 2024
Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models Xiang Ji Sanjeev Kulkarni Mengdi Wang Tengyang Xie OffRL 45 4 0 06 Jun 2024
The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedback Ruitao Chen Liwei Wang 75 1 0 18 May 2024
Active Preference Learning for Ordering Items In- and Out-of-sample Herman Bergström Emil Carlsson Devdatt Dubhashi Fredrik D. Johansson 49 0 0 05 May 2024
Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback Qiwei Di Jiafan He Quanquan Gu 31 1 0 16 Apr 2024
Feel-Good Thompson Sampling for Contextual Dueling Bandits Xuheng Li Heyang Zhao Quanquan Gu 44 9 0 09 Apr 2024
DP-Dueling: Learning from Preference Feedback without Compromising User Privacy Aadirupa Saha Hilal Asi 36 1 0 22 Mar 2024
MENTOR: Guiding Hierarchical Reinforcement Learning with Human Feedback and Dynamic Distance Constraint Xinglin Zhou Yifu Yuan Shaofu Yang Jianye Hao 39 1 0 22 Feb 2024
Reinforcement Learning from Human Feedback with Active Queries Kaixuan Ji Jiafan He Quanquan Gu 26 17 0 14 Feb 2024
Principled Preferential Bayesian Optimization Wenjie Xu Wenbin Wang Yuning Jiang B. Svetozarevic Colin N. Jones 32 6 0 08 Feb 2024
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF Banghua Zhu Michael I. Jordan Jiantao Jiao 36 25 0 29 Jan 2024
A Minimaximalist Approach to Reinforcement Learning from Human Feedback Gokul Swamy Christoph Dann Rahul Kidambi Zhiwei Steven Wu Alekh Agarwal OffRL 51 96 0 08 Jan 2024
Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources Rohan Deb Aadirupa Saha 38 0 0 28 Dec 2023
Faster Convergence with Multiway Preferences Aadirupa Saha Vitaly Feldman Tomer Koren Yishay Mansour 28 1 0 19 Dec 2023
Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks Zihao Li Xiang Ji Minshuo Chen Mengdi Wang OffRL 46 0 0 16 Oct 2023
Identifying Copeland Winners in Dueling Bandits with Indifferences Viktor Bengs Björn Haddenhorst Eyke Hüllermeier 53 0 0 01 Oct 2023
Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism Zihao Li Zhuoran Yang Mengdi Wang OffRL 39 55 0 29 May 2023
Principled Reinforcement Learning with Human Feedback from Pairwise or $K$ -wise Comparisons Banghua Zhu Jiantao Jiao Michael I. Jordan OffRL 42 184 0 26 Jan 2023
One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits Pierre Gaillard Aadirupa Saha Soham Dan 30 3 0 26 Oct 2022
ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive Non-Stationary Dueling Bandits Thomas Kleine Buening Aadirupa Saha 48 6 0 25 Oct 2022
Dueling Convex Optimization with General Preferences Aadirupa Saha Tomer Koren Yishay Mansour 30 2 0 27 Sep 2022
Flexible and Efficient Contextual Bandits with Heterogeneous Treatment Effect Oracles Aldo G. Carranza Sanath Kumar Krishnamurthy Susan Athey 21 1 0 30 Mar 2022
Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences Aadirupa Saha Pierre Gaillard 38 8 0 14 Feb 2022
Dueling RL: Reinforcement Learning with Trajectory Preferences Aldo Pacchiano Aadirupa Saha Jonathan Lee 35 82 0 08 Nov 2021
Optimal Dynamic Regret in Exp-Concave Online Learning Dheeraj Baby Yu Wang 50 44 0 23 Apr 2021