Applying reinforcement learning (RL) methods on robots typically involves training a policy in simulation and deploying it on a robot in the real world. Because of the model mismatch between the real world and the simulator, RL agents deployed in this manner tend to perform suboptimally. To tackle this problem, researchers have developed robust policy learning algorithms that rely on synthetic noise disturbances. However, such methods do not guarantee performance in the target environment. We propose a convex risk minimization algorithm to estimate the model mismatch between the simulator and the target domain using trajectory data from both environments. We show that this estimator can be used along with the simulator to evaluate performance of an RL agents in the target domain, effectively bridging the gap between these two environments. We also show that the convergence rate of our estimator to be of the order of ${n^{-1/4}}$, where $n$ is the number of training samples. In simulation, we demonstrate how our method effectively approximates and evaluates performance on Gridworld, Cartpole, and Reacher environments on a range of policies. We also show that the our method is able to estimate performance of a 7 DOF robotic arm using the simulator and remotely collected data from the robot in the real world.
翻译:在机器人问题上应用强化学习(RL)方法通常涉及模拟政策和在现实世界中的机器人上部署该技术的政策。由于真实世界和模拟器之间的模型不匹配,以这种方式部署的RL代理器往往能发挥副最佳效果。为解决这一问题,研究人员开发了强大的政策学习算法,依赖合成噪音扰动。然而,这些方法并不能保证目标环境中的性能。我们建议采用一种控制风险最小化算法,利用两种环境的轨迹数据估计模拟器和目标域之间的模型不匹配情况。我们表明,这个估计器可以与模拟器一起使用,以评价目标区域范围内的RL代理器的性能,从而有效地缩小这两个环境之间的差距。我们还表明,我们的估计器的趋同率是按${n ⁇ -1/4 ⁇ ]的顺序排列的。而美元是培训样本的数量。我们建议采用一种最小风险最小化算法来估计模型在Gridworld、Cartpole和Reader环境中的性能。我们还展示了与模拟器在政策范围内的模拟器中,我们所收集到的机器人和遥控机世界的性能数据能够估计。我们所收集到的机器人的模型。