We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data.
翻译:我们提议了Alogorithm蒸馏法(AD),这是一种将强化学习算法蒸馏成神经网络的方法,通过以因果序列模型模拟其培训历史模型,将强化学习算法(RL)植入神经网络。Agorithm蒸馏法将学习强化学习作为一个跨式顺序预测问题。一个源RL算法生成了学习历史数据集,然后一个因果变压器通过自动递增预测其先前学习历史作为背景的行动来培训。与蒸馏后学习或专家序列的顺序政策预测结构不同,AD能够在不更新其网络参数的情况下完全改进政策。我们证明,AD可以强化在各种环境中的文艺学习,其奖赏稀少、组合任务结构和像素基观测,并发现AD学习的数据效率高于生成源数据的方算法。