In this paper, we propose AsyncQVI, an asynchronous-parallel Q-value iteration for discounted Markov decision processes whose transition and reward can only be sampled through a generative model. Given such a problem with states, actions, and a discounted factor , AsyncQVI uses memory of size and returns an -optimal policy with probability at least using \tilde{\mathcal{O}}\big(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^2}\log(\frac{1}{\delta})\big) samples. AsyncQVI is also the first asynchronous-parallel algorithm for discounted Markov decision processes that has a sample complexity, which nearly matches the theoretical lower bound. The relatively low memory footprint and parallel ability make AsyncQVI suitable for large-scale applications. In numerical tests, we compare AsyncQVI with four sample-based value iteration methods. The results show that our algorithm is highly efficient and achieves linear parallel speedup.
View on arXiv