Most policy search algorithms require thousands of training episodes to find an effective policy, which is often infeasible with a physical robot. This survey article focuses on the extreme other end of the spectrum: how can a robot adapt with only a handful of trials (a dozen) and a few minutes? By analogy with the word "big-data", we refer to this challenge as "micro-data reinforcement learning". We show that a first strategy is to leverage prior knowledge on the policy structure (e.g., dynamic movement primitives), on the policy parameters (e.g., demonstrations), or on the dynamics (e.g., simulators). A second strategy is to create data-driven surrogate models of the expected reward (e.g., Bayesian optimization) or the dynamical model (e.g., model-based policy search), so that the policy optimizer queries the model instead of the real system. Overall, all successful micro-data algorithms combine these two strategies by varying the kind of model and prior knowledge. The current scientific challenges essentially revolve around scaling up to complex robots (e.g., humanoids), designing generic priors, and optimizing the computing time.
翻译:多数政策搜索算法都需要数千个培训过程才能找到有效的政策, 而这往往与物理机器人不相容。 本调查文章侧重于最极端的另一端: 一个机器人如何适应仅用一小部分试验( 十来个)和几分钟的机器人? 我们用“ 大数据” 来比喻这个挑战, 我们称之为“ 微数据强化学习 ” 。 我们显示, 第一个战略是利用关于政策结构( 如动态运动原始元素)、 政策参数( 如演示) 或动态( 如模拟器) 的知识。 第二个战略是创建数据驱动的预期奖励( 如巴伊西亚优化) 或动态模型( 如模型政策搜索) 的替代模型 。 因此, 政策优化者询问模型而不是真实系统 。 总之, 所有成功的微数据算法都通过改变模型和先前知识, 将这两种战略结合起来。 目前的科学挑战基本上围绕着向复杂机器人( 如人类机) 扩展的模型、 、 设计前置、 和 优化原型计算机 。