Failing to Explore: Language Models on Interactive Tasks

29 January 2026

Mahdi JafariRaviz

Keivan Rezaei

Arshia Soltani Moakhar

Zahra Sodagar

Yize Cheng

Soheil Feizi

LRM

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)Github (8★)

Main:7 Pages

18 Figures

Bibliography:5 Pages

18 Tables

Appendix:21 Pages

Abstract

We evaluate language models on their ability to explore interactive environments under a limited interaction budget. We introduce three parametric tasks with controllable exploration difficulty, spanning continuous and discrete environments. Across state-of-the-art models, we find systematic under-exploration and suboptimal solutions, with performance often significantly worse than simple explore--exploit heuristic baselines and scaling weakly as the budget increases. Finally, we study two lightweight interventions: splitting a fixed budget into parallel executions, which surprisingly improves performance despite a no-gain theoretical result for our tasks, and periodically summarizing the interaction history, which preserves key discoveries and further improves exploration.

View on arXiv

Comments on this paper