Model-Based Reinforcement Learning involves learning a \textit{dynamics model} from data, and then using this model to optimise behaviour, most often with an online \textit{planner}. Much of the recent research along these lines presents a particular set of design choices, involving problem definition, model learning and planning. Given the multiple contributions, it is difficult to evaluate the effects of each. This paper sets out to disambiguate the role of different design choices for learning dynamics models, by comparing their performance to planning with a ground-truth model -- the simulator. First, we collect a rich dataset from the training sequence of a model-free agent on 5 domains of the DeepMind Control Suite. Second, we train feed-forward dynamics models in a supervised fashion, and evaluate planner performance while varying and analysing different model design choices, including ensembling, stochasticity, multi-step training and timestep size. Besides the quantitative analysis, we describe a set of qualitative findings, rules of thumb, and future research directions for planning with learned dynamics models. Videos of the results are available at https://sites.google.com/view/learning-better-models.
翻译:以模型为基础的强化学习需要从数据中学习\ textit{ 动力学模型,然后利用这一模型优化行为,通常使用在线\ textit{ planner}。最近围绕这些方针进行的许多研究展示了一套特殊的设计选择,涉及问题定义、模式学习和规划。鉴于多种贡献,很难评估每种模型的效果。本文通过比较其业绩和地面真相模型 -- -- 模拟器 -- -- 来区分不同设计选择对于学习动态模型的作用,从而将其与规划相混淆。首先,我们从深明控制套的5个领域的无型代理的培训序列中收集了丰富的数据集。第二,我们以监督的方式培训进化动态模型,评估规划者的业绩,同时对不同的模型设计选择进行不同的分析,包括组合、分析性、多步培训和时间档大小。除了定量分析外,我们还描述了一套定性发现、拇指规则以及未来研究方向,以便用学习过的动态模型进行规划。结果的录像/模型可在 https://site/commexismal.