Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the entire distribution of the total return rather than only its expectation. The empirical success of distributional RL is determined by the representation of return distributions and the choice of distribution divergence. In this paper, we propose a new class of \textit{Sinkhorn distributional RL~(SinkhornDRL)} algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then uses Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Sinkhorn divergence features as the interpolation between the Wasserstein distance and Maximum Mean Discrepancy~(MMD). SinkhornDRL finds a sweet spot by taking advantage of the geometry of optimal transport-based distance and the unbiased gradient estimate property of MMD. Finally, compared to state-of-the-art algorithms, SinkhornDRL's competitive performance is demonstrated on the suit of 55 Atari games.
翻译:分配强化学习~ (RL) 是一类最先进的算法, 用来估计总回报的分布, 而不是仅仅估计其预期。 分配RL 的成功经验取决于返回分布的表示和分布差异的选择。 在本文中, 我们提出一种新的 \ textit{ Sinkhorn 分布RL~ (SinkhornDRL) 算法, 用来学习一套有限的统计数据, 即确定性样本, 从每次返回分布中, 然后使用 Sinkhorn 迭代法来评价当前和目标贝尔曼分布之间的辛克角距离。 Sinkhorn 差异特征是瓦瑟斯坦距离和最大平均值差异~ (MMDD) 之间的内插。 Sinkhorn DRL 发现一个甜点, 利用基于最佳运输距离的几何法和不偏向梯度估计 MMD的属性。 最后, SinkhornDL 的竞争性表现在55 Atari 游戏的套中展示。