Learning effective policies for sparse objectives is a key challenge in Deep Reinforcement Learning (RL). A common approach is to design task-related dense rewards to improve task learnability. While such rewards are easily interpreted, they rely on heuristics and domain expertise. Alternate approaches that train neural networks to discover dense surrogate rewards avoid heuristics, but are high-dimensional, black-box solutions offering little interpretability. In this paper, we present a method that discovers dense rewards in the form of low-dimensional symbolic trees - thus making them more tractable for analysis. The trees use simple functional operators to map an agent's observations to a scalar reward, which then supervises the policy gradient learning of a neural network policy. We test our method on continuous action spaces in Mujoco and discrete action spaces in Atari and Pygame environments. We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks. Notably, we significantly outperform a widely used, contemporary neural-network based reward-discovery algorithm in all environments considered.
翻译:对稀有目标的学习有效政策是深强化学习(RL)中的一项关键挑战。 一种共同的方法是设计与任务相关的密集奖励,以改善任务学习能力。 虽然这种奖励容易解释, 却依赖超自然学和领域专长。 另一种方法是培养神经网络以发现密集的代孕奖励, 避免了超自然论, 但却是高维的黑箱解决方案, 几乎没有解释性。 在本文中, 我们展示了一种方法, 以低维象征性树的形式发现密集的奖励 — — 从而使它们更容易用于分析。 树木使用简单的功能操作员将代理人的观察结果映射成一个标度奖励, 从而监督神经网络政策的政策梯度学习。 我们测试了我们在Mujoco的持续行动空间和Atari 和 Pygame 环境中的独立行动空间的方法。 我们显示, 发现的密集的奖励是RL 政策解决基准任务的有效信号。 值得注意的是, 我们明显超越了在所有被考虑的环境中广泛使用的基于神经网络的奖分算法。