Despite the success of Random Network Distillation (RND) in various domains, it was shown as not discriminative enough to be used as an uncertainty estimator for penalizing out-of-distribution actions in offline reinforcement learning. In this paper, we revisit these results and show that, with a naive choice of conditioning for the RND prior, it becomes infeasible for the actor to effectively minimize the anti-exploration bonus and discriminativity is not an issue. We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM), resulting in a simple and efficient ensemble-free algorithm based on Soft Actor-Critic. We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
翻译:尽管随机网络蒸馏(RND)在不同领域取得了成功,但事实证明,它不够具有歧视性,不足以用作一种不确定性的估测器,用以惩罚非在线强化学习中的分配外行动。在本文中,我们重新审视了这些结果,并表明,由于对RND之前的附加条件做出了天真的选择,行为者无法有效地尽量减少反勘探奖金,而歧视也不是一个问题。我们表明,如果以精致的线性移动(FILM)为条件,从而形成一种以Soft Allor-Critic为基础的简单而高效的无共通算法,这一限制是可以避免的。我们用D4RL基准来评估它,表明它能够达到与共用方法相类似的性,并且能够以宽广的幅度实现超效的无共用方法。