Adversarial Policy Optimization for Offline Preference-based Reinforcement LearningInternational Conference on Learning Representations (ICLR), 2025 |
DOPL: Direct Online Preference Learning for Restless Bandits with Preference FeedbackInternational Conference on Learning Representations (ICLR), 2024 |
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward InferenceInternational Conference on Learning Representations (ICLR), 2024 |