9
3

Tempo Adaptation in Non-stationary Reinforcement Learning

Abstract

We first raise and tackle a ``time synchronization'' issue between the agent and the environment in non-stationary reinforcement learning (RL), a crucial factor hindering its real-world applications. In reality, environmental changes occur over wall-clock time (tt) rather than episode progress (kk), where wall-clock time signifies the actual elapsed time within the fixed duration t[0,T]t \in [0, T]. In existing works, at episode kk, the agent rolls a trajectory and trains a policy before transitioning to episode k+1k+1. In the context of the time-desynchronized environment, however, the agent at time tkt_{k} allocates Δt\Delta t for trajectory generation and training, subsequently moves to the next episode at tk+1=tk+Δtt_{k+1}=t_{k}+\Delta t. Despite a fixed total number of episodes (KK), the agent accumulates different trajectories influenced by the choice of interaction times (t1,t2,...,tKt_1,t_2,...,t_K), significantly impacting the suboptimality gap of the policy. We propose a Proactively Synchronizing Tempo (ProST\texttt{ProST}) framework that computes a suboptimal sequence {t1,t2,...,tKt_1,t_2,...,t_K} (= { t1:Kt_{1:K}}) by minimizing an upper bound on its performance measure, i.e., the dynamic regret. Our main contribution is that we show that a suboptimal {t1:Kt_{1:K}} trades-off between the policy training time (agent tempo) and how fast the environment changes (environment tempo). Theoretically, this work develops a suboptimal {t1:Kt_{1:K}} as a function of the degree of the environment's non-stationarity while also achieving a sublinear dynamic regret. Our experimental evaluation on various high-dimensional non-stationary environments shows that the ProST\texttt{ProST} framework achieves a higher online return at suboptimal {t1:Kt_{1:K}} than the existing methods.

View on arXiv
Comments on this paper