Many practical applications of reinforcement learning require agents to learn from sparse and delayed rewards. It challenges the ability of agents to attribute their actions to future outcomes. In this paper, we consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. A popular paradigm for this problem setting is learning with a designed auxiliary dense reward function, namely proxy reward, instead of sparse environmental signals. Based on this framework, this paper proposes a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning. We establish a surrogate problem by Monte-Carlo sampling that scales up least-squares-based reward redistribution to long-horizon problems. We analyze our surrogate loss function by connection with existing methods in the literature, which illustrates the algorithmic properties of our approach. In experiments, we extensively evaluate our proposed method on a variety of benchmark tasks with episodic rewards and demonstrate substantial improvement over baseline algorithms.
翻译:强化学习的许多实际应用要求代理人从稀少和延迟的奖励中学习。 它挑战代理人将其行动与未来结果挂钩的能力。 在本文中, 我们考虑利用轨迹反馈来制定附带强化学习的问题。 它指的是奖励信号的极端延迟, 代理人只能在每个轨迹结束时获得一个奖赏信号。 这个问题设置的流行范例是学习设计出来的辅助性密集奖赏功能, 即代理奖赏, 而不是稀疏的环境信号。 根据这个框架, 本文提出一个新的奖赏再分配算法, 随机回归分解( RRD), 学习辅助学习的代理奖赏功能 。 我们通过蒙特- 卡洛 抽样 建立一个代理奖赏问题, 将基于最不平的奖赏再分配比重提升到长视距问题。 我们通过文献中的现有方法来分析我们的隐性损失函数, 说明我们方法的算法特性。 在实验中, 我们广泛评价了我们关于一系列基准任务的拟议方法的方法, 附带子奖赏, 并显示基线算法的显著改进。