Reinforcement Learning has drawn huge interest as a tool for solving optimal control problems. Solving a given problem (task or environment) involves converging towards an optimal policy. However, there might exist multiple optimal policies that can dramatically differ in their behaviour; for example, some may be faster than the others but at the expense of greater risk. We consider and study a distribution of optimal policies. We design a curiosity-augmented Metropolis algorithm (CAMEO), such that we can sample optimal policies, and such that these policies effectively adopt diverse behaviours, since this implies greater coverage of the different possible optimal policies. In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems, and even in the challenging case of environments that provide sparse rewards. We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability, and represents a first step towards learning the distribution of optimal policies itself.
翻译:强化学习作为一种解决最佳控制问题的工具,吸引了极大的兴趣。 解决特定问题(任务或环境)需要融合到一个最佳政策上。 但是,可能存在多种最佳政策,其行为可能大不相同;例如,有些政策可能比其他政策更快,但牺牲了更大的风险。我们考虑并研究最佳政策的分布。我们设计了一个好奇强化大都会算法(CAMEO),这样我们就可以对最佳政策进行抽样,并且这些政策能够有效地采取不同的行为,因为这意味着对各种可能的最佳政策进行更大的覆盖。在实验性模拟中,我们证明CAMEO确实得到了解决所有传统控制问题的政策,甚至对于提供微薄报酬的富有环境来说也是如此。我们进一步表明,我们抽样的不同政策提出了不同的风险简介,与解释性方面有趣的实际应用相对应,是学习最佳政策本身分配的第一步。