Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing
Bike-sharing systems (BSS) provide a sustainable urban mobility solution, but ensuring their reliability requires effective rebalancing strategies to address stochastic demand and prevent station imbalances. This paper proposes reinforcement learning (RL) algorithms for dynamic rebalancing problem with multiple vehicles, introducing and comparing two RL approaches: Single-policy RL and Dual-policy RL. We formulate this network optimization problem as a Markov Decision Process within a continuous-time framework, allowing vehicles to make independent and cooperative rebalancing decisions without synchronization constraints. In the first approach, a single deep Q-network (DQN) is trained to jointly learn inventory and routing decisions. The second approach decouples node-level inventory decisions from arc-level vehicle routing, enhancing learning efficiency and adaptability. A high-fidelity simulator under the first-arrive-first-serve rule is developed to estimate rewards across diverse demand scenarios influenced by temporal and weather variations. Extensive experiments demonstrate that while the single-policy model is competitive against several benchmarks, the dual-policy model significantly reduces lost demand. These findings provide valuable insights for bike-sharing operators, reinforcing the potential of RL for real-time rebalancing and paving the way for more adaptive and intelligent urban mobility solutions.
View on arXiv