This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy -- with as few samples as possible -- that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on tabular robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithm, and further show it is almost unimprovable in light of a nearly-matching information-theoretic lower bound up to a polynomial factor of the (effective) horizon length. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.
翻译:本文涉及离线强化学习中的模型稳健性和抽样效率等核心问题,目的是学习如何从历史数据中做出决策,而不进行积极勘探。由于环境的不确定性和差异性,至关重要的是要学习一个稳健的政策 -- -- 尽可能少的样本 -- -- 即使部署的环境偏离了用于收集历史数据集的名义环境,该政策也表现良好。我们认为,对离线RL进行分布式强的配方式配方,重点是表格式稳健的Markov决策程序,其不确定性是由Kullback-Leibiler在限定和无限正方位设置上的差异所设定的。要克服抽样稀缺,建议采用基于模型的算法,将分布稳健的数值反复计算与面对不确定性的悲观原则结合起来,用精心设计的数据驱动的罚款术语对稳健的估值进行惩罚。根据对历史数据集的简单和定制的假设,即测量模式的分布变化不需要完全覆盖州-行动空间,我们建立了拟议的算法的有限缩略性复杂度,并进一步表明,在接近稳健度的缩缩度范围内,这是我们最接近最难的学习因素。