Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
翻译:在医学、经济学和电子商务等领域,在线实验费用昂贵、危险或不道德,而且真实模型未知的领域,加强离线学习很重要。 但是,大多数方法都假定行为政策行动决定中使用的所有共变法。 这个无法检验的假设可能是不正确的。 我们研究强健的政策评估和政策优化,在没有观测到的困惑者在场的情况下进行。 我们假设可能未观察到的混乱程度可以受敏感模型的约束,而未观测的混淆者是依次外生的。 我们提出和分析一个(高度化的)强健的贝尔曼操作员的封闭式齐装Q术语,以获得稳健的Q函数的损失最小化问题。 我们的算法在计算上可以容易地从或分解中计算出适合Q的和统计改进(对定量估计错误的依赖程度) 。 我们提供样本复杂度的界限、洞察力和在模拟中显示有效性。