This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a -discounted infinite-horizon Markov game with states, where the max-player has actions and the min-player has actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an -approximate Nash equilibrium with a sample complexity no larger than (up to some log factor). Here, is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy can be any value within . Our sample complexity bound strengthens prior art by a factor of , achieving minimax optimality for the entire -range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.
View on arXiv