Distributional reinforcement learning (DRL) extends the value-based approach by approximating the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile Regression (QR) based methods like QR-DQN project arbitrary distributions into a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance. However, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Non-crossing constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cram\'er distance yields a projection that coincides with the 1-Wasserstein one and that, under non-crossing constraints, the squared Cram\'er and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a low complexity algorithm to compute the Cram\'er distance.
翻译:增强分布性强化学习( DRL) 扩展了基于价值的方法, 将完全分布与未来回报相近, 而不是仅提供平均值, 从而提供更丰富的信号, 从而改善性能。 以QR- DQN 工程为基量回归( QR- DQN) 为基础的方法, 如QR- DQN 工程的任意分布, 通过将1- Wasserstein 距离最小化, 进入一个楼梯分布的参数子集。 但是, 由于梯度偏差, 微量回归损失被用于培训, 保证相同的最小化, 并享受公正的梯度。 量化的非交叉限制已证明了QR- DQN 的性能, 从而改善了基于不确定性的勘探战略。 这项工作的贡献在于设定固定的四分数水平, 并且是双重的。 首先, 我们证明 Cram\ 的距离可以产生与1- Wasserstein 的距离, 在非交叉制约下, 平方 Cram\ 和 位回归损失将产生曲线梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度, 。