Bi-directional Recurrence Improves Transformer in Partially Observable Markov Decision Processes

16 May 2025

Abstract

In real-world reinforcement learning (RL) scenarios, agents often encounter partial observability, where incomplete or noisy information obscures the true state of the environment. Partially Observable Markov Decision Processes (POMDPs) are commonly used to model these environments, but effective performance requires memory mechanisms to utilise past observations. While recurrence networks have traditionally addressed this need, transformer-based models have recently shown improved sample efficiency in RL tasks. However, their application to POMDPs remains underdeveloped, and their real-world deployment is constrained due to the high parameter count. This work introduces a novel bi-recurrent model architecture that improves sample efficiency and reduces model parameter count in POMDP scenarios. The architecture replaces the multiple feed forward layers with a single layer of bi-directional recurrence unit to better capture and utilize sequential dependencies and contextual information. This approach improves the model's ability to handle partial observability and increases sample efficiency, enabling effective learning from comparatively fewer interactions. To evaluate the performance of the proposed model architecture, experiments were conducted on a total of 23 POMDP environments. The proposed model architecture outperforms existing transformer-based, attention-based, and recurrence-based methods by a margin ranging from 87.39% to 482.04% on average across the 23 POMDP environments.

View on arXiv

@article{arora2025_2505.11153,
  title={ Bi-directional Recurrence Improves Transformer in Partially Observable Markov Decision Processes },
  author={ Ashok Arora and Neetesh Kumar },
  journal={arXiv preprint arXiv:2505.11153},
  year={ 2025 }
}

Comments on this paper