Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines.
翻译:最近,决策变换器等方法将强化学习降低到预测任务,并通过监督学习(RvS)解决,这些方法由于其简单性、超参数坚固性、离线RL任务总体绩效强而变得受欢迎。然而,简单地将概率模型设定为预期返回和采取预期行动,在随机环境中会大失所望,因为导致返回的轨迹可能只因运气而实现返回。在这项工作中,我们描述了RvS方法在随机环境中的局限性并提出解决方案。我们的拟议方法,即ESPER,不是简单地根据标准做法对单一轨迹的返回进行调节,而是对单一轨迹的返回进行调节,而是学习集集成轨和平均集回报的条件,这种模型独立于环境相近的返回。这样,ESPER就可以在目标返回和真实环境中的预期表现之间实现强烈的一致。我们在一些具有挑战性的离线-RL任务中展示了这一点,包括具有挑战性的解谜游戏 2048 和连接四人对一个受威胁的对手进行战斗。在所有测试的域中,ESPER获得比最高回报的回报率还要好得多。