Estimating optimal dynamic policies from offline data is a fundamental problem in dynamic decision making. In the context of causal inference, the problem is known as estimating the optimal dynamic treatment regime. Even though there exists a plethora of methods for estimation, constructing confidence intervals for the value of the optimal regime and structural parameters associated with it is inherently harder, as it involves non-linear and non-differentiable functionals of un-known quantities that need to be estimated. Prior work resorted to sub-sample approaches that can deteriorate the quality of the estimate. We show that a simple soft-max approximation to the optimal treatment regime, for an appropriately fast growing temperature parameter, can achieve valid inference on the truly optimal regime. We illustrate our result for a two-period optimal dynamic regime, though our approach should directly extend to the finite horizon case. Our work combines techniques from semi-parametric inference and $g$-estimation, together with an appropriate triangular array central limit theorem, as well as a novel analysis of the asymptotic influence and asymptotic bias of softmax approximations.
翻译:从离线数据中估计最佳动态政策是动态决策的根本问题。在因果推断方面,问题被称为估算最佳动态处理机制。尽管存在大量估算方法,但为最佳制度的价值和与之相关的结构性参数构建信任间隔具有固有的难度,因为这涉及非线性和不可区分的、数量不明的、需要估算的功能。先前的工作采用次抽样方法,可能会降低估计质量。我们表明,对最佳处理机制进行简单的软式近似,对于适当快速增长的温度参数来说,可以实现对真正最佳制度的有效推断。我们展示了两期最佳动态制度的结果,尽管我们的方法应该直接延伸到有限地平线。我们的工作结合了半偏差推论和以美元计估算的技术,同时结合了适当的三角阵列核心定理学,以及软质近似近似的影响和偏差的新分析。</s>