Machine learning the first stage in 2SLS: Practical guidance from bias decomposition and simulation

19 May 2025

Connor Lennon

Edward Rubin

Glen Waddell

ArXiv (abs)PDF HTML

Main:24 Pages

8 Figures

Bibliography:4 Pages

2 Tables

Appendix:8 Pages

Abstract

Machine learning (ML) primarily evolved to solve "prediction problems." The first stage of two-stage least squares (2SLS) is a prediction problem, suggesting potential gains from ML first-stage assistance. However, little guidance exists on when ML helps 2SLS $\unicode{x2014}$ or when it hurts. We investigate the implications of inserting ML into 2SLS, decomposing the bias into three informative components. Mechanically, ML-in-2SLS procedures face issues common to prediction and causal-inference settings $\unicode{x2014}$ and their interaction. Through simulation, we show linear ML methods (e.g., post-Lasso) work well, while nonlinear methods (e.g., random forests, neural nets) generate substantial bias in second-stage estimates $\unicode{x2014}$ potentially exceeding the bias of endogenous OLS.

View on arXiv

Comments on this paper