Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1502.06362
Cited By

Contextual Dueling Bandits

v1v2 (latest)

Contextual Dueling Bandits

23 February 2015

Miroslav Dudík

Robert Schapire

Aleksandrs Slivkins

ArXiv (abs)PDF HTML

Papers citing "Contextual Dueling Bandits"

50 / 96 papers shown

Offline Clustering of Preference Learning with Active-data Augmentation

Offline Clustering of Preference Learning with Active-data Augmentation

Fatemeh Ghaffari

Mohammad Hajiesmaili

Carlee Joe-Wong

280

0

0

30 Oct 2025

Greedy Sampling Is Provably Efficient for RLHF

Greedy Sampling Is Provably Efficient for RLHF

148

2

0

28 Oct 2025

Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

232

0

0

21 Oct 2025

Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making

Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making

206

0

0

19 Oct 2025

A-IPO: Adaptive Intent-driven Preference Optimization

A-IPO: Adaptive Intent-driven Preference Optimization

Muhammad Asif Ali

144

1

0

11 Oct 2025

Recycling History: Efficient Recommendations from Contextual Dueling Bandits

Recycling History: Efficient Recommendations from Contextual Dueling Bandits

Suryanarayana Sankagiri

Matthias Grossglauser

161

0

0

26 Aug 2025

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

599

2

0

04 Aug 2025

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

434

2

0

08 Jun 2025

Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration

Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration

273

1

0

02 Jun 2025

Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds

Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds

215

1

0

29 May 2025

Proximal Point Nash Learning from Human Feedback

Proximal Point Nash Learning from Human Feedback

Daniele Calandriello

Denis Belomestny

274

4

0

26 May 2025

Sample Complexity of Identifying the Nonredundancy of Nontransitive Games in Dueling Bandits

Sample Complexity of Identifying the Nonredundancy of Nontransitive Games in Dueling Bandits

328

0

0

08 May 2025

Toward Efficient Exploration by Large Language Model Agents

Toward Efficient Exploration by Large Language Model Agents

Thomas L. Griffiths

473

12

0

29 Apr 2025

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Emiliano Penaloza

Tianyue H. Zhan

Laurent Charlin

Mateo Espinosa Zarlenga

711

4

0

25 Apr 2025

Reinforcement Learning from Multi-level and Episodic Human Feedback

Reinforcement Learning from Multi-level and Episodic Human FeedbackConference on Learning for Dynamics & Control (L4DC), 2025

Muhammad Qasim Elahi

Somtochukwu Oguchienti

Maheed H. Ahmed

600

0

0

20 Apr 2025

VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences

VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences

Souradip Chakraborty

Amrit Singh Bedi

387

7

0

18 Mar 2025

Cost-Aware Optimal Pairwise Pure Exploration

Cost-Aware Optimal Pairwise Pure ExplorationInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2025

313

0

0

10 Mar 2025

Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Towards a Sharp Analysis of Offline Policy Learning for

f

-Divergence-Regularized Contextual Bandits

495

0

0

09 Feb 2025

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

699

18

0

07 Nov 2024

Sample-Efficient Alignment for LLMs

Sample-Efficient Alignment for LLMs

315

14

0

03 Nov 2024

Adaptive Segment-level Reward: Bridging the Gap Between Action and Reward Space in Alignment

Adaptive Segment-level Reward: Bridging the Gap Between Action and Reward Space in Alignment

148

0

0

23 Oct 2024

Optimal Design for Reward Modeling in RLHF

Optimal Design for Reward Modeling in RLHF

Etienne Boursier

Michael I. Jordan

Michal Valko

558

21

0

22 Oct 2024

Accelerated Preference Optimization for Large Language Model Alignment

Accelerated Preference Optimization for Large Language Model Alignment

248

6

0

08 Oct 2024

DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

DOPL: Direct Online Preference Learning for Restless Bandits with Preference FeedbackInternational Conference on Learning Representations (ICLR), 2024

Efstathia Soufleri

Debajoy Mukherjee

Srinivas Shakkottai

375

2

0

07 Oct 2024

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHFInternational Conference on Learning Representations (ICLR), 2024

Jonathan D. Chang

Kianté Brantley

515

21

0

06 Oct 2024

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

537

3

0

03 Oct 2024

FedPT: Federated Proxy-Tuning of Large Language Models on
Resource-Constrained Edge Devices

FedPT: Federated Proxy-Tuning of Large Language Models on Resource-Constrained Edge Devices

Zhidong Gao

Yu Zhang

Yuanxiong Guo

208

3

0

01 Oct 2024

Inverse-Q*: Token Level Reinforcement Learning for Aligning Large
Language Models Without Preference Data

Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference DataConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Qi Zhang

Xuanjing Huang

283

12

0

27 Aug 2024

Biased Dueling Bandits with Stochastic Delayed Feedback

Biased Dueling Bandits with Stochastic Delayed Feedback

453

3

0

26 Aug 2024

Conversational Dueling Bandits in Generalized Linear Models

Conversational Dueling Bandits in Generalized Linear Models

Hui Yuan

Mengdi Wang

207

4

0

26 Jul 2024

Bandits with Preference Feedback: A Stackelberg Game Perspective

Bandits with Preference Feedback: A Stackelberg Game Perspective

Parnian Kassraie

Andreas Krause

448

6

0

24 Jun 2024

Adversarial Multi-dueling Bandits

Adversarial Multi-dueling Bandits

Pratik Gajane

242

1

0

18 Jun 2024

Online Bandit Learning with Offline Preference Data for Improved RLHF

Online Bandit Learning with Offline Preference Data for Improved RLHF

Akhil Agnihotri

Deepak Ramachandran

804

4

0

13 Jun 2024

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

464

3

0

11 Jun 2024

Self-Play with Adversarial Critic: Provable and Scalable Offline
Alignment for Language Models

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Xiang Ji

Sanjeev Kulkarni

387

12

0

06 Jun 2024

Active Preference Learning for Ordering Items In- and Out-of-sample

Active Preference Learning for Ordering Items In- and Out-of-sampleNeural Information Processing Systems (NeurIPS), 2024

Herman Bergström

Emil Carlsson

Devdatt Dubhashi

Fredrik D. Johansson

301

6

0

05 May 2024

Self-Play Preference Optimization for Language Model Alignment

Self-Play Preference Optimization for Language Model Alignment

Quanquan Gu

680

229

0

01 May 2024

REBEL: Reinforcement Learning via Regressing Relative Rewards

REBEL: Reinforcement Learning via Regressing Relative Rewards

Jonathan D. Chang

Kianté Brantley

Thorsten Joachims

J. Andrew Bagnell

456

69

0

25 Apr 2024

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

513

5

0

16 Apr 2024

Dataset Reset Policy Optimization for RLHF

Dataset Reset Policy Optimization for RLHF

Jonathan D. Chang

Kianté Brantley

Dipendra Kumar Misra

542

36

0

12 Apr 2024

Feel-Good Thompson Sampling for Contextual Dueling Bandits

Feel-Good Thompson Sampling for Contextual Dueling Bandits

Quanquan Gu

260

17

0

09 Apr 2024

Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Michael Santacroce

Ahmed Hassan Awadallah

573

171

0

04 Apr 2024

DP-Dueling: Learning from Preference Feedback without Compromising User
Privacy

DP-Dueling: Learning from Preference Feedback without Compromising User Privacy

347

1

0

22 Mar 2024

Provable Multi-Party Reinforcement Learning with Diverse Human Feedback

Provable Multi-Party Reinforcement Learning with Diverse Human Feedback

Weijie J. Su

Zhiwei Steven Wu

268

27

0

08 Mar 2024

Reinforcement Learning from Human Feedback with Active Queries

Reinforcement Learning from Human Feedback with Active Queries

Quanquan Gu

524

38

0

14 Feb 2024

Online Iterative Reinforcement Learning from Human Feedback with General
Preference Model

Online Iterative Reinforcement Learning from Human Feedback with General Preference ModelNeural Information Processing Systems (NeurIPS), 2024

Wei Xiong

Tong Zhang

337

34

0

11 Feb 2024

Principled Preferential Bayesian Optimization

Principled Preferential Bayesian Optimization

B. Svetozarevic

331

14

0

08 Feb 2024

Efficient Exploration for LLMs

Efficient Exploration for LLMs

Vikranth Dwaracherla

Benjamin Van Roy

516

43

0

01 Feb 2024

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

A Minimaximalist Approach to Reinforcement Learning from Human FeedbackInternational Conference on Machine Learning (ICML), 2024

Zhiwei Steven Wu

642

144

0

08 Jan 2024

Think Before You Duel: Understanding Complexities of Preference Learning
under Constrained Resources

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

248

0

0

28 Dec 2023

Page 1 of 2