Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference

Estimating long-term causal effects from short-term data is essential for decision-making in healthcare, economics, and industry, where long-term follow-up is often infeasible. Markov Decision Processes (MDPs) offer a principled framework for modeling outcomes as sequences of states, actions, and rewards over time. We introduce a semiparametric extension of Double Reinforcement Learning (DRL) for statistically efficient, model-robust inference on linear functionals of the Q-function, such as policy values, in infinite-horizon, time-homogeneous MDPs. By imposing semiparametric structure on the Q-function, our method relaxes the strong state overlap assumptions required by fully nonparametric approaches, improving efficiency and stability. To address computational and robustness challenges of minimax nuisance estimation, we develop a novel debiased plug-in estimator based on isotonic Bellman calibration, which integrates fitted Q-iteration with an isotonic regression step. This procedure leverages the Q-function as a data-driven dimension reduction, debiases all linear functionals of interest simultaneously, and enables nonparametric inference without explicit nuisance function estimation. Bellman calibration generalizes isotonic calibration to MDPs and may be of independent interest for prediction in reinforcement learning. Finally, we show that model selection for the Q-function incurs only second-order bias and extend the adaptive debiased machine learning (ADML) framework to MDPs for data-driven learning of semiparametric structure.
View on arXiv@article{laan2025_2501.06926, title={ Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference }, author={ Lars van der Laan and David Hubbard and Allen Tran and Nathan Kallus and Aurélien Bibaut }, journal={arXiv preprint arXiv:2501.06926}, year={ 2025 } }