We consider the setting of iterative learning control, or model-based policy learning in the presence of uncertain, time-varying dynamics. In this setting, we propose a new performance metric, planning regret, which replaces the standard stochastic uncertainty assumptions with worst case regret. Based on recent advances in non-stochastic control, we design a new iterative algorithm for minimizing planning regret that is more robust to model mismatch and uncertainty. We provide theoretical and empirical evidence that the proposed algorithm outperforms existing methods on several benchmarks.
翻译:我们考虑在不确定、时间变化的动态下设置迭代学习控制或基于模式的政策学习。 在这种背景下,我们提出一个新的业绩衡量标准,即规划遗憾,用最坏的遗憾来取代标准的随机不确定性假设。 根据最近非随机控制的进展,我们设计了新的迭代算法,以尽量减少规划遗憾,这种算法对于模拟不匹配和不确定性更为有力。我们提供了理论和经验证据,证明提议的算法在几个基准上优于现有方法。