Multi-armed bandits are one of the fundamental problems in sequential decision theory, and are currently relevant to artificial intelligence and online services. In the cases of continuum-armed and tree-armed bandits, we describe an algorithm obtaining near-optimal rates of regret, without knowledge of the reward distributions. In tree-armed bandits, our algorithm can work with infinite trees, and adaptively combine multiple trees so as to minimise the regret. Applying this algorithm to continuum-armed bandits, we obtain square-root regret, without prior information, whenever the mean function satisfies a condition we call zooming continuity, which holds in some generality.
View on arXiv