41
0

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Abstract

We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during TT episodes, each of which consists of HH stages, and each episode is evaluated with respect to a reward function that will be revealed only at the end of the episode. We propose an algorithm, called APO-MVP, that achieves a regret bound of order O~(poly(H)SAT)\tilde{\mathcal{O}}(\mathrm{poly}(H)\sqrt{SAT}), where SS and AA are sizes of the state and action spaces, respectively. This result improves upon the best-known regret bound by a factor of S\sqrt{S}, bridging the gap between adversarial and stochastic MDPs, and matching the minimax lower bound Ω(H3SAT)\Omega(\sqrt{H^3SAT}) as far as the dependencies in S,A,TS,A,T are concerned. The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures; instead, it performs policy optimization based only on dynamic programming and on a black-box online linear optimization strategy run over estimated advantage functions, making it easy to implement. The analysis leverages two recent techniques: policy optimization based on online linear optimization strategies (Jonckheere et al., 2023) and a refined martingale analysis of the impact on values of estimating transitions kernels (Zhang et al., 2023).

View on arXiv
@article{tiapkin2025_2407.05704,
  title={ Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization },
  author={ Daniil Tiapkin and Evgenii Chzhen and Gilles Stoltz },
  journal={arXiv preprint arXiv:2407.05704},
  year={ 2025 }
}
Comments on this paper