This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy -- with as few samples as possible -- that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on a tabular non-stationary finite-horizon robust Markov decision process with an uncertainty set specified by the Kullback-Leibler divergence. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithm, and further show it is almost unimprovable in light of a nearly-matching information-theoretic lower bound up to a polynomial factor of the horizon length. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.
翻译:本文涉及离线强化学习中的模型稳健性和样本效率等核心问题,目的是学习如何从历史数据中做出决策,而不进行积极探索。由于环境的不确定性和差异性,至关重要的是要学习一个稳健的政策 -- -- 尽可能少的样本 -- -- 即使部署的环境偏离了用于收集历史数据集的标称环境。我们认为,离线RL的分布性强的配方是稳健的,侧重于表单上非静止的有限和稳健的马尔科夫决策过程,其不确定性是由库列普-利伯尔差异所指定的。要与抽样稀缺作斗争,建议一种基于模型的算法,在面临不确定性的情况下,将分配稳健的值重复与悲观原则结合起来,方法是用精心设计的数据驱动的罚款术语来惩罚稳健的估值。在不要求完全覆盖州-行动空间的情况下衡量分布变化的历史数据集的简单和定制的假设下,我们建立了拟议的算法的有限性复杂度,并进一步表明,在接近稳定度的深度范围下,几乎是无法精确的。