Robust Policy Search is the problem of learning policies that do not degrade in performance when subject to unseen environment model parameters. It is particularly relevant for transferring policies learned in a simulation environment to the real world. Several existing approaches involve sampling large batches of trajectories which reflect the differences in various possible environments, and then selecting some subset of these to learn robust policies, such as the ones that result in the worst performance. We propose an active learning based framework, EffAcTS, to selectively choose model parameters for this purpose so as to collect only as much data as necessary to select such a subset. We apply this framework using Linear Bandits, and experimentally validate the gains in sample efficiency and the performance of our approach on standard continuous control tasks. We also present a Multi-Task Learning perspective to the problem of Robust Policy Search, and draw connections from our proposed framework to existing work on Multi-Task Learning.
翻译:强力政策搜索是学习政策的问题,在受不可见的环境模型参数影响时,这种政策不会在性能方面退化。它对于将模拟环境中所学的政策转移到现实世界尤为重要。几种现有办法包括抽样抽查大量反映不同环境差异的轨迹,然后选择其中某些子集来学习稳健的政策,例如导致最差业绩的政策。我们提议一个积极的学习框架EffAcTS, 以便为此目的有选择地选择模式参数, 以便只收集必要的数据来选择这样一个子集。我们使用线性强盗来应用这个框架,并实验性地验证抽样效率的提高和我们标准连续控制任务方法的绩效。我们还从多塔斯克学习角度介绍强力政策搜索问题,并从我们提议的框架与现有多塔克学习工作联系起来。