We study reinforcement learning (RL) in settings where observations are high-dimensional, but where an RL agent has access to abstract knowledge about the structure of the state space, as is the case, for example, when a robot is tasked to go to a specific room in a building using observations from its own camera, while having access to the floor plan. We formalize this setting as transfer reinforcement learning from an abstract simulator, which we assume is deterministic (such as a simple model of moving around the floor plan), but which is only required to capture the target domain's latent-state dynamics approximately up to unknown (bounded) perturbations (to account for environment stochasticity). Crucially, we assume no prior knowledge about the structure of observations in the target domain except that they can be used to identify the latent states (but the decoding map is unknown). Under these assumptions, we present an algorithm, called TASID, that learns a robust policy in the target domain, with sample complexity that is polynomial in the horizon, and independent of the number of states, which is not possible without access to some prior knowledge. In synthetic experiments, we verify various properties of our algorithm and show that it empirically outperforms transfer RL algorithms that require access to "full simulators" (i.e., those that also simulate observations).
翻译:我们研究的是高维观测环境的强化学习(RL),但是在这样的环境中,RL代理可以获取关于国家空间结构的抽象知识,例如,当机器人的任务是使用自己的相机的观测,进入一个建筑物中的某个房间,同时可以访问地面计划时,我们研究的是强化学习(RL),我们将此设置正式化为从抽象模拟器中进行传输强化学习,我们认为这种模拟器具有确定性(例如一个简单的移动在地面计划周围的模型),但只需要它能够捕捉目标区域的潜在状态动态,大约为未知的(封闭的)扰动(环境随机性),例如,当机器人的任务是利用自己的相机的观测去到一个建筑物中的某个特定房间时,我们完全没有假定任何关于目标区域中观测结构的知识,除非它们能够用来确定潜在状态(但解码地图未知)。根据这些假设,我们提出了一个算法,称为TASID,在目标区域中学习一种稳健的政策,其样本复杂性在地平线上是多式的,并且独立于国家数目,而没有获得某种观测,因此也不可能获得某些先前的观测。在目标区域领域的观测,我们需要一些模拟算算的模型的模型。我们需要各种的模型的模型分析。