Model-based Reinforcement Learning (MBRL) holds promise for data-efficiency by planning with model-generated experience in addition to learning with experience from the environment. However, in complex or changing environments, models in MBRL will inevitably be imperfect, and their detrimental effects on learning can be difficult to mitigate. In this work, we question whether the objective of these models should be the accurate simulation of environment dynamics at all. We focus our investigations on Dyna-style planning in a prediction setting. First, we highlight and support three motivating points: a perfectly accurate model of environment dynamics is not practically achievable, is not necessary, and is not always the most useful anyways. Second, we introduce a meta-learning algorithm for training models with a focus on their usefulness to the learner instead of their accuracy in modelling the environment. Our experiments show that in a simple non-stationary environment, our algorithm enables faster learning than even using an accurate model built with domain-specific knowledge of the non-stationarity.
翻译:以模型为基础的强化学习(MBRL)通过利用模型生成的经验进行规划,除了从环境中学习经验外,还利用模型生成的经验来提高数据效率。然而,在复杂或变化的环境中,模型的模型必然不完善,对学习的有害影响难以减轻。在这项工作中,我们质疑这些模型的目标是否应当是对环境动态的准确模拟。我们在预测环境中把调查的重点放在Dyna风格的规划上。首先,我们强调并支持三个激励点:一个完全准确的环境动态模型实际上无法实现,并不必要,而且总是最有用的。第二,我们为培训模型采用元学习算法,重点是这些模型对学习者有用,而不是在模拟环境中的准确性。我们的实验表明,在简单的非静止环境中,我们的算法能够更快地学习,而不是使用一个精确的模型,该模型以特定领域对非静止性的知识为基础。