Distributional reinforcement learning (DRL) extends the value-based approach by using a deep convolutional network to approximate the full distribution over future returns instead of the mean only, providing a richer signal that leads to improved performances. Quantile-based methods like QR-DQN project arbitrary distributions onto a parametric subset of staircase distributions by minimizing the 1-Wasserstein distance, however, due to biases in the gradients, the quantile regression loss is used instead for training, guaranteeing the same minimizer and enjoying unbiased gradients. Recently, monotonicity constraints on the quantiles have been shown to improve the performance of QR-DQN for uncertainty-based exploration strategies. The contribution of this work is in the setting of fixed quantile levels and is twofold. First, we prove that the Cram\'er distance yields a projection that coincides with the 1-Wasserstein one and that, under monotonicity constraints, the squared Cram\'er and the quantile regression losses yield collinear gradients, shedding light on the connection between these important elements of DRL. Second, we propose a novel non-crossing neural architecture that allows a good training performance using a novel algorithm to compute the Cram\'er distance, yielding significant improvements over QR-DQN in a number of games of the standard Atari 2600 benchmark.
翻译:强化分配学习( DRL) 扩展基于价值的方法, 使用深层的卷变网络, 以近近未来回报的全面分布, 而不是平均值, 提供更富的信号, 从而改善绩效。 QR- DQN 项目的量化方法, 如QR- DQN 工程的任意分布, 通过最小化1瓦瑟斯坦距离, 将1瓦瑟斯坦距离降低到楼梯分布的分数子数, 然而, 由于梯度偏差, 微量回归损失被用于培训, 保证相同的最小值, 并享受公正的梯度。 最近, 量化的单调限制显示可以改善 QR- DQN 的性能, 从而改善基于不确定性的勘探战略。 这项工作的贡献在于设定固定的四分数水平, 并且具有双重性。 首先, 我们证明Cram\er距离的预测与1- 瓦瑟斯坦( 1- 瓦瑟斯坦) 的距离一致, 在单调制约下, 平方 Cramlear 和 二次回归损失导致我们之间的二次线梯度梯度变, Q, 显示这些重要的硬度 QQLDR