Reinforcement learning (RL) has made a lot of advances for solving a single problem in a given environment; but learning policies that generalize to unseen variations of a problem remains challenging. To improve sample efficiency for learning on such instances of a problem domain, we present Self-Paced Context Evaluation (SPaCE). Based on self-paced learning, \spc automatically generates \task curricula online with little computational overhead. To this end, SPaCE leverages information contained in state values during training to accelerate and improve training performance as well as generalization capabilities to new instances from the same problem domain. Nevertheless, SPaCE is independent of the problem domain at hand and can be applied on top of any RL agent with state-value function approximation. We demonstrate SPaCE's ability to speed up learning of different value-based RL agents on two environments, showing better generalization capabilities and up to 10x faster learning compared to naive approaches such as round robin or SPDRL, as the closest state-of-the-art approach.
翻译:强化学习(RL)在解决特定环境中的单一问题方面取得了许多进步;但是,学习政策一般化为一个问题的无形变异仍然具有挑战性。为了提高在这类问题领域学习的样本效率,我们提出“自略背景评估 ” (SPaCE ) 。基于自定进度的学习,\spc 自动在网上生成 task 课程,而很少计算间接费用。为此,SPaCE 利用培训期间国家价值观中的信息,加快和改进培训业绩,提高普及能力,从同一问题领域到新情况。然而,SPaCE 独立于问题领域,并且可以在任何具有国家价值功能接近的RL 代理的顶部应用。我们展示了SPaCE 在两种环境中加速学习不同价值的RL 代理的能力,展示了更好的普及能力和多达10x的更快的学习速度,而最接近于天真的方法,例如圆轮抢劫或SPDRL,作为最先进的状态方法。