\textit{SQT} -- \textit{std} $Q$ -target

3 February 2024

Nitsan Soffair

Abstract

\textit{Std} $Q$ -target is a \textit{conservative}, actor-critic, ensemble, $Q$ -learning-based algorithm, which is based on a single key $Q$ -formula: $Q$ -networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of \textit{overestimation} bias. We implement \textit{SQT} on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate \textit{SQT}'s $Q$ -target formula superiority over \textit{TD3}'s $Q$ -target formula as a \textit{conservative} solution to overestimation bias in RL, while \textit{SQT} shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.

View on arXiv

Comments on this paper

\textit{SQT} -- \textit{std} QQQ-target

\textit{SQT} -- \textit{std} $Q$ -target