利用直视物理定向探索实现高效采样的Sim2 real转让 (Intuitive Physics Guided Exploration for Sample Efficient Sim2real Transfer)

Physics-based reinforcement learning tasks can benefit from simplified physics simulators as they potentially allow near-optimal policies to be learned in simulation. However, such simulators require the latent factors (e.g. mass, friction coefficient etc.) of the associated objects and other environment-specific factors (e.g. wind speed, air density etc.) to be accurately specified, without which, it could take considerable additional learning effort to adapt the learned simulation policy to the real environment. As such a complete specification can be impractical, in this paper, we instead, focus on learning task-specific estimates of latent factors which allow the approximation of real world trajectories in an ideal simulation environment. Specifically, we propose two new concepts: a) action grouping - the idea that certain types of actions are closely associated with the estimation of certain latent factors, and; b) partial grounding - the idea that simulation of task-specific dynamics may not need precise estimation of all the latent factors. We first introduce intuitive action groupings based on human physics knowledge and experience, which is then used to design novel strategies for interacting with the real environment. Next, we describe how prior knowledge of a task in a given environment can be used to extract the relative importance of different latent factors, and how this can be used to inform partial grounding, which enables efficient learning of the task in any arbitrary environment. We demonstrate our approach in a range of physics based tasks, and show that it achieves superior performance relative to other baselines, using only a limited number of real-world interactions.

翻译：基于物理的强化学习任务可受益于简化物理学模拟器,因为这些模拟器有可能允许在模拟环境中学习接近最佳的政策,但是,这些模拟器需要精确地说明相关物体的潜在因素(如质量、摩擦系数等)和其他环境特定因素(如风速、空气密度等)的隐含因素(如风速、空气密度等),而没有这些要素,可能需要大量的额外学习努力,使学习的模拟模拟政策适应真实环境。由于这样全面的规定可能不切实际,因此,在本文件中,我们注重学习具体任务对潜在因素的估算,这些潜在因素使得在理想的模拟环境中能够接近真实的世界轨迹相互作用。具体地说,我们提出两个新概念:a)行动组合——即某些类型的行动与某些潜在因素的估计密切相关,b)部分地认为,模拟特定任务可能不需要精确地估计所有潜在因素。我们首先根据人类物理知识和经验引入直观的行动组,然后用于设计与现实环境相对轨迹进行互动的新战略,然后我们提出两个新概念:a)行动组合——即某些类型的行动类型行动与实际学习任务,我们如何利用了某种程度的精细度,用来说明在地面任务中用来说明某种程度上使用某种程度,从而显示某种程度的实地任务是如何利用某种程度,从而显示某种程度了解某种程度的任务是如何利用某种程度。