Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking

17 August 2016

Zachary Chase Lipton

Li Deng

Abstract

When rewards are sparse and efficient exploration essential, deep Q-learning with $\epsilon$ -greedy exploration tends to fail. This poses problems for otherwise promising domains such as task-oriented dialog systems, where the primary reward signal, indicating successful completion, typically occurs only at the end of each episode but depends on the entire sequence of utterances. A poor agent encounters such successful dialogs rarely, and a random agent may never stumble upon a successful outcome in reasonable time. We present two techniques that significantly improve the efficiency of exploration for deep Q-learning agents in dialog systems. First, we demonstrate that exploration by Thompson sampling, using Monte Carlo samples from a Bayes-by-Backprop neural network, yields marked improvement over standard DQNs with Boltzmann or $\epsilon$ -greedy exploration. Second, we show that spiking the replay buffer with a small number of successes, as are easy to harvest for dialog tasks, can make Q-learning feasible when it might otherwise fail catastrophically.

View on arXiv

Comments on this paper