ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.11019
99
0
v1v2v3 (latest)

Relative Entropy Pathwise Policy Optimization

15 July 2025
C. Voelcker
Axel Brunnbauer
Marcel Hussing
Michal Nauman
Pieter Abbeel
Eric Eaton
Radu Grosu
Amir-massoud Farahmand
Igor Gilitschenski
ArXiv (abs)PDFHTMLGithub (30★)
Main:13 Pages
14 Figures
Bibliography:8 Pages
2 Tables
Appendix:6 Pages
Abstract

Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.

View on arXiv
Comments on this paper