20
14

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Abstract

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound O~(d3B2TK)\widetilde{O}\left(\sqrt{d^3B_{\star}^2T_{\star} K}\right), where dd is the dimension of the feature space, BB_{\star} and TT_{\star} are upper bounds of the expected costs and hitting time of the optimal policy respectively, and KK is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order O(d3B4cmin2gapminln5dBKcmin)O\left(\frac{d^3B_{\star}^4}{c_{\min}^2\text{gap}_{\min}}\ln^5\frac{dB_{\star} K}{c_{\min}} \right), where gapmin\text{gap}_{\min} is the minimum sub-optimality gap and cminc_{\min} is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound O~(d3.5BK)\widetilde{O}(d^{3.5}B_{\star}\sqrt{K}) with no polynomial dependency on TT_{\star} or 1/cmin1/c_{\min}, almost matching the Ω(dBK)\Omega(dB_{\star}\sqrt{K}) lower bound from (Min et al., 2021).

View on arXiv
Comments on this paper