Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the whole distribution of the total return rather than only its expectation. The representation manner of each return distribution and the choice of distribution divergence are pivotal for the empirical success of distributional RL. In this paper, we propose a new class of \textit{Sinkhorn distributional RL} algorithm that learns a finite set of statistics, i.e., deterministic samples, from each return distribution and then leverages Sinkhorn iterations to evaluate the Sinkhorn distance between the current and target Bellmen distributions. Remarkably, as Sinkhorn divergence interpolates between the Wasserstein distance and Maximum Mean Discrepancy~(MMD). This allows our proposed Sinkhorn distributional RL algorithms to find a sweet spot leveraging the geometry of optimal transport-based distance, and the unbiased gradient estimates of MMD. Finally, experiments on a suite of Atari games reveal the competitive performance of Sinkhorn distributional RL algorithm as opposed to existing state-of-the-art algorithms.
翻译:分配强化学习~ (RL) 是一类最先进的算法, 用来估计总回报的分布, 而不是仅仅估计其预期值。 每个返回分布的表示方式和分配差异的选择对于分配RL的经验成功至关重要。 在本文中, 我们提议了一种新的 \ textit{ Sinkhorn 分布RL} 算法, 用来从每次返回分布中学习一定数量的统计数据, 即确定性样本, 然后利用 Sinkhorn 迭代法来评估当前和目标贝尔门分布之间的辛克霍恩距离。 值得注意的是, 辛克霍恩 运算法在瓦塞斯坦 距离 和 最大平均值差异~ (MMD) 之间的相互偏差 。 这使得我们提议的辛克霍恩 分配RL 算法能够找到一个甜点, 利用最佳运输距离的几何和 MMD 的不偏斜度估计值。 最后, 对一套阿塔里游戏的实验显示Sinkhorn 分配 RL 算法的竞争性性表现, 而不是现有的状态算法 。