In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold, else there are lower bounds exhibiting exponential error amplification (in the problem horizon) unless the data collection distribution has only a mild distribution shift relative to the target policy. This work studies these issues from an empirical perspective to gauge how stable offline RL methods are. In particular, our methodology explores these ideas when using features from pre-trained neural networks, in the hope that these representations are powerful enough to permit sample efficient offline RL. Through extensive experiments on a range of tasks, we see that substantial error amplification does occur even when using such pre-trained representations (trained on the same task itself); we find offline RL is stable only under extremely mild distribution shift. The implications of these results, both from a theoretical and an empirical perspective, are that successful offline RL (where we seek to go beyond the low distribution shift regime) requires substantially stronger conditions beyond those which suffice for successful supervised learning.
翻译:在网外强化学习(RL)中,我们力求利用离线数据来评价(或学习)政策,在数据是从与要评价的目标政策大不相同的分布中收集的数据所收集的情景中,评价(或学习)政策;最近的一些理论进步表明,只要有某些强有力的代表性条件,这种在离线下高效的抽样RL确实有可能,如果能够保持一定的强烈代表性条件,其他则有显示指数错误放大(在问题地平线上)的较低界限,除非数据收集分布与目标政策相比只有轻微的分布变化;这项工作从经验角度研究这些问题,以衡量离线下稳定的RL方法。特别是,我们的方法在使用预先培训的神经网络的特征时探索这些想法,希望这些表达方式足够强大,足以允许在离线上高效的RL进行抽样。 通过对一系列任务进行广泛的实验,我们看到即使在使用这种预先培训的表述(对同一任务本身进行训练);我们发现离线下RL只是在极其轻微的分布变迁的情况下才保持稳定。从理论和实验的角度看,这些结果的影响是,离线上成功的RL(我们寻求超越低分配制度,要大大地进行更强大的学习)。