连续控制任务强值迭接 (Robust Value Iteration for Continuous Control Tasks)

When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well. Commonly, the optimal policy overfits to the approximate model and the corresponding state-distribution, often resulting in failure to trasnfer underlying distributional shifts. In this paper, we present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain and incorporates adversarial perturbations of the system dynamics. The adversarial perturbations encourage a optimal policy that is robust to changes in the dynamics. Utilizing the continuous-time perspective of reinforcement learning, we derive the optimal perturbations for the states, actions, observations and model parameters in closed-form. Notably, the resulting algorithm does not require discretization of states or actions. Therefore, the optimal adversarial perturbations can be efficiently incorporated in the min-max value function update. We apply the resulting algorithm to the physical Furuta pendulum and cartpole. By changing the masses of the systems we evaluate the quantitative and qualitative performance across different model parameters. We show that robust value iteration is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm. Videos of the experiments are shown at https://sites.google.com/view/rfvi

翻译：当将控制政策从模拟向物理系统转移时,该政策需要稳健,以适应动态的变化,从而取得良好的效果。通常,最佳政策适用于近似模型和相应的州分布,往往导致无法进行分流。在本文中,我们介绍Robust Fitted 值循环,它使用动态编程来计算压缩状态域的最佳值功能,并纳入系统动态的对称扰动。对抗性扰动鼓励一种对动态变化具有强力的最佳政策。我们利用不断时间的强化学习视角,我们为封闭式的州、行动、观察和模型参数得出最佳的扰动。值得注意的是,由此产生的算法并不要求州或行动的离散化。因此,最佳对称的对称调插图可以有效地纳入光速值功能的更新中。我们将由此产生的算法应用到实际的 Furuta pendulum和马波尔。我们通过改变系统组合来评估不同模型参数的定量和定性绩效。我们显示,强健的对调度/卡斯特洛夫洛的演算法是强的数值。我们显示,在深度的图像上显示它是如何强化的变压。