Dynamics of Softmax Q-Learning in Two-Player Two-Action Games
We consider the dynamics of Q-learning in two-player two-action games with Boltzmann exploration mechanism. For any non-zero exploration rate the dynamics is dissipative, which guarantees that agent strategies converge to rest points that are generally different from the game's Nash Equlibria (NE). We provide a comprehensive characterization of the rest point structure for different games, and examine the sensitivity of this structure with respect to the noise due to exploration. Our results indicate that for a class of games with multiple NE the asymptotic behavior of learning dynamics can undergo drastic changes at a critical exploration rate. A somewhat counterintuitive manifestation of this behavior is that increasing noise might lead the agents to select a more optimal solution.
View on arXiv