ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.04374
15
2

Near-Optimal Adversarial Reinforcement Learning with Switching Costs

8 February 2023
Ming Shi
Yitao Liang
Ness B. Shroff
ArXivPDFHTML
Abstract

Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient β\betaβ that is strictly positive and is independent of TTT) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than Ω~((HSA)1/3T2/3)\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )Ω~((HSA)1/3T2/3), where TTT, SSS, AAA and HHH are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on TTT is O~(T)\tilde{O}(\sqrt{T})O~(T​)) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of O~(H1/3)\tilde{O}( H^{1/3} )O~(H1/3) when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.

View on arXiv
Comments on this paper