155

Computational Hardness of Reinforcement Learning with Partial qπq^π-Realizability

Main:9 Pages
1 Figures
Bibliography:2 Pages
4 Tables
Appendix:13 Pages
Abstract

This paper investigates the computational complexity of reinforcement learning in a novel linear function approximation regime, termed partial qπq^{\pi}-realizability. In this framework, the objective is to learn an ϵ\epsilon-optimal policy with respect to a predefined policy set Π\Pi, under the assumption that all value functions for policies in Π\Pi are linearly realizable. The assumptions of this framework are weaker than those in qπq^{\pi}-realizability but stronger than those in qq^*-realizability, providing a practical model where function approximation naturally arises. We prove that learning an ϵ\epsilon-optimal policy in this setting is computationally hard. Specifically, we establish NP-hardness under a parameterized greedy policy set (argmax) and show that - unless NP = RP - an exponential lower bound (in feature vector dimension) holds when the policy set contains softmax policies, under the Randomized Exponential Time Hypothesis. Our hardness results mirror those in qq^*-realizability and suggest computational difficulty persists even when Π\Pi is expanded beyond the optimal policy. To establish this, we reduce from two complexity problems, δ\delta-Max-3SAT and δ\delta-Max-3SAT(b), to instances of GLinear-κ\kappa-RL (greedy policy) and SLinear-κ\kappa-RL (softmax policy). Our findings indicate that positive computational results are generally unattainable in partial qπq^{\pi}-realizability, in contrast to qπq^{\pi}-realizability under a generative access model.

View on arXiv
Comments on this paper